STA2100-Regression Analysis
STA2100-Regression Analysis
Chapter 9
Relations (Simple Linear Regression)
Learning outcomes
Upon completing this topic, you should be able to:
• Define correlation, regression and know the link between these two concepts
• Calculate the equations of least squares regression lines, and use them to
estimate any given values for a set of data
• Interpret the meaning of the values obtained in the regression equation, and
how the values link to the correlation coefficient.
134
STA 2100 Probability and Statistics I
1. Introduction Regression
As indicated in introductory section of the previous lesson, we shall now be looking
at regression in this lesson, and specifically simple linear regression.
If two variables are significantly correlated, and if there is some theoretical basis
for doing so, it is possible to predict values of one variable from the other.
Regression analysis, in general sense, means the estimation or prediction of the
unknown value of one variable from the known value of the other variable. It is
one of the most important statistical tools which is extensively used in almost all
sciences – Natural, Social and Physical.
“Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data.”
Definitions
Predictions
One of the primary advantages of knowing about a relationship between two vari-
ables is that one can use the knowledge to facilitate making predictions. Specif-
ically when one has exact knowledge of the individual score on one of the two
variables, then he/she can use the knowledge of the relationship to increase the
accuracy of a prediction of the individuals’ score on the other variable. As a note,
by the term prediction in this case, we mean a “best guess” of what a single
value of score will be. e.g. the value to be predicted can be the number of years
that a 62 yr old woman with high blood pressure will live or the time an individual
machine will take before it will break down. Therefore, a prediction is a guess
about the value of a term to be drawn from a specified population.
A predictor variable is one that provides relevant information for predicting what
scores will be on some other variable.
135
STA 2100 Probability and Statistics I
A predicted variable is one about which predictions are made. (An exact rela-
tionship between the predictor and predicted variable is very essential if we
are to make the most accurate predictions possible)
The simplest form of regression analysis, called simple linear regression or straight
line regression which involves the statistical modeling between a single input fac-
tor X (the “regerssor”) and a single output variable Y (the “response”).
• The plot of a line with slope 2 and intercept 1 is depicted in the following
figure:
136
STA 2100 Probability and Statistics I
• Even though the straight-line model is perfect for algebra class, in the real
world finding such a perfect linear relationship is next to impossible.
• Most real world relationships are not perfectly linear models, but imperfect
models where the relationship between x and y is more like the correlation plot
we saw earlier.
• In this case, the question literally is “Where do you draw the line?”.
Simple linear regression is the statistical technique to correctly answer this
equation.
• Simple linear regression is the statistical model between X and Y in the real
world, where there is random variation associated with measured variable
quantities. To study the relationship between X and Y, the simplest rela-
tionship is that of a straight line, as opposed to a more complex relationship
such as a polynomial.
• Therefore in most cases we want to try to fit the data to a linear model.
137
STA 2100 Probability and Statistics I
• The second step, once it is determined that a linear model is a good idea,
is to determine the best fitting line that represents the relationship.
• For any given value of X, the true mean value of Y depends on X, which can
be written µy|x .
• In regression, the line represents mean values of Y not individual data values.
Each observation Yi is independent of all other, Yj 6= Yi
138
STA 2100 Probability and Statistics I
1.0
0.5
gene2
0.0
−1.0
gene1
• This method is subject to much error and is unlikely we will produce the
“best fitting” line. Therefore a more sophisticated method is needed.
• Regression analysis can be thought of as being sort of like the flip side of
correlation. It has to do with finding the equation for the kind of straight
lines we have just looked at.
• Suppose we have a sample of size n and it has two sets of measures, denoted
by x and y. We can predict the values of y given the values of x by using
the equation, y∗ = a + bx . Where,
x y)/n x2 − ( x)2
P P P P P
b = (n xy −
For a we have
a = y + bx
or rewritten as
P P
a = ( y − b x)/n
The symbol y∗ refers to the predicted value of y from a given value of x from
the regression equation.
139
STA 2100 Probability and Statistics I
• Suppose we have the linear equation y = 25 + 20x which gives the total
cost, y of a word processing job. Given the amount of time required,x, we
can use the equation to determine the exact cost of the job,y.
• However, things are not quite simple as in this case of word processing exam-
ple. So more often than not we have to be content with rough predictions.
In fact, for many circumstances, the variable being predicted will vary even
for a fixed value of the variable being used to make the prediction.
• For instance, we cannot predict the exact price of a Datsun Z cars by just
knowing the age . Indeed even for a fixed age, say three (3) years old, the
price of a Datsun Z varies from car to car.
Example. Suppose we have the following data on Age Vs Price of Datsun Z’s.
Age(yrs)5 7 6 6 5 4 7 6 5 5 2
Price 80 57 58 55 70 88 43 60 69 63 118
($100)
It’s useful to plot the data so that we can visualize the apparent relationship
between Age and price. Such plot is known as a scatter diagram.
140
STA 2100 Probability and Statistics I
From the diagram, it’s clear that the points are not on a straight line, but it’s
apparent they are clustered about a straight line. Hence, we fit a straight line
to the data, and then we could use that line to predict the price of Datsun Z’s.
Since it is possible to draw many reasonable looking straight lines through the
cluster of points, we need a method to choose the “Best” line. The method used
is known as the Least-square criterion.
So how does it work?
Simple illustration;
Suppose we have two lines A and B drawn for a set of plots in a scatter
diagram, say;
Line A: y = 0.5 + 1.25x
Line B: y = −0.25 + 1.5x
Then we have the following predicted values, and errors for the two lines as
follows:
x y yˆA e= e2A yˆB e= e2B
y− y−
yˆA yˆB
0 2 0.5 1.5 2.25 - 2.25 5.0625
0.25
1 4 1.75 2.25 5.0625 1.25 2.75 7.5625
2 6 3 3 9.00 2.75 3.25 10.5625
3 8 4.25 3.75 14.0625 4.25 3.75 14.0625
P 2 P 2
eA = eB =
30.375 37.25
Where,
x is the observed value of x
y is the observed value of y
eA is the error made if we use line A for prediction
eB is the error made of we use line B for prediction
• The rule for choosing the best line among several possible lines, is that we
choose the line with the smaller value of e . This line will give the best
P 2
fit for the data at hand. This may not be an easy task as we shall be forced
to draw all the possible lines, which may not also be possible. To solve the
141
STA 2100 Probability and Statistics I
• Hence, from the above examples of lines A and B, we would choose line A
as it has the least square error, i.e. its the line of best fit for the data if we
were to consider only these two lines.
x y)/n x2 − ( x)2
P P P P P
b = (n xy −
or rewritten as
P P
a = ( y − b x)/n
We can then derive the line of best fit for the Datsun Z cars example, and also
answer the following questions;
Example. Refer to Age Vs Price data for the Datsun Z’s:
1. Determine the regression equation for the data; i.e. find the equation of the
regression line.
2. Describe the apparent relationship between Age and price for Datsun Zs
3. What does the slope of the regression equation represent in terms of the
prices for Datsun Zs?
142
STA 2100 Probability and Statistics I
4. Use the regression equation to predict the price for a two year-old Z and a
five-year old Z.
Solution
x y xy x2
5 80 400 25
7 57 399 49
6 58 348 36
6 55 330 36
5 70 350 25
4 88 352 16
7 43 301 49
6 60 360 36
5 69 345 25
5 63 315 25
2 118 236 4
58 761 3,736 326
143
STA 2100 Probability and Statistics I
3. Here, we are to describe the apparent relationship between age and price for
Datsun Zs. Since the slope of the regression line is negative, we see that
the price tends to decrease as age increases-Any surprises!
4. For this part, we are to interpret the slope of the regression equation in
terms of the prices for Datsun Zs. To begin, recall that represents age, in
years, and represents price, in hundred dollars. The slope of -13.70 or $1,370
indicates that Datsun Zs depreciate an estimated $1,370 per year, at least
in the two-to seven year-old range.
144
STA 2100 Probability and Statistics I
Remark 4. If you plan to find a regression line for a set of data points, first look at
a scatter diagram of the data. If data points do not appear to be scattered about
a straight line, do not determine a regression line.
2. Exercises
Exercise 22. Scores made by students in a statistics class in the mid-term and
final examination are given here. Develop a regression equation which may be used
to predict final examination scores from the mid – term score.
Student 1 2 3 4 5 6 7 8 9 10
Mid- 98 66 100 96 88 45 76 60 74 82
term
Final 90 74 98 88 80 62 78 74 86 80
3. Revision Questions
The following is a list of questions that will assist you in your revision.
Practice Problems:
Subject 1 2 3 4 5
Hamburgers 5 4 3 2 1
Beers 8 10 4 6 2
2. A horse owner is investigating the relationship between weight carried and the finish
position of several horses in his stable. Calculate r and R for the data given
Weight 11 11 12 11 11 11 11 12 10 10 11 11
carried
Position 0
2 3
6 0
3 5
4 0
6 5 7
4 3
2 6
1 8
4 0
1 0
3
Finishe
3. The top and bottom number which may appear on a die are as follows Calculate r
d
and R for these values. Are the results surprising?
Top 1 2 3 4 5 6
Bottom 5 6 4 3 1 2
X 38 42 29 31 28 15 24 17 19 11 8 19 3 14 6
y 4 3 11 5 9 6 14 9 10 15 19 17 10 14 18
b) What does this statistic mean concerning the relationship between death
anxiety and religiosity?
c) What percent of the variability is accounted for by the relation of these two
variables?
5. The data given below are obtained from student records.( Grade Point Average (x)
and Graduate Record exam score (y)) Calculate the regression equation and compute
the estimated GRE scores for GPA = 7.5 and 8.5..
Subject 11 12 13 14 15 16 17 18 19 20
X 8.3 8.6 9.2 9.8 8.0 7.8 9.4 9.0 7.2 8.6
y 2300 2250 2380 2400 2000 2100 2360 2350 2000 2260
145
STA 2100 Probability and Statistics I
6. A horse was subject to the test of how many minutes it takes to reach a point from
the starting point. The horse was made to carry luggage of various weights on 10
trials.. The data collected are presented below in the table. Find the regression
equation between the load and the time taken to reach the goal. Estimate the time
taken for the loads of 35 Kgs , 23 Kgs, and 9 Kgs. Are the answers in agreement with
your intuitive feelings? Justify.
Trial 1 2 3 4 5 6 8 8 9 10
Number 11
Weight 23 16 32 12 28 29 19 25 20
(in Kgs) 13
Time 22 16 47 13 39 43 21 32 22
taken
(in
7. A study was conducted
mins) to find whether there is any relationship between the weight
and blood pressure of an individual. The following set of data was arrived at from a
clinical study.
Serial 1 2 3 4 5 6 8 8 9 10
Number 78
Weight 86 72 822 80 86 84 89 68 71
Blood 140 160 134 144 180 176 174 178 128 132
Pressure
8. It is assumed that achievement test scores should be correlated with student's
classroom performance. One would expect that students who consistently perform
well in the classroom (tests, quizzes, etc.) would also perform well on a standardized
achievement test (0 - 100 with 100 indicating high achievement (x)). A teacher
decides to examine this hypothesis. At the end of the academic year, she computes a
correlation between the students achievement test scores (she purposefully did not
look at this data until after she submitted students grades) and the overall G.P.A.(y)
for each student computed over the entire year. The data for her class are provided
below.
X 98 96 94 88 01 77 86 71 59 6 8 7 7 7 8 8 7 9 9 6
Y 3.6 2.7 3.1 4.0 3.2 3.0 3.8 2.6 3.0 3 4 9
2 1 5 2
3 2 6 3
2 2 5 1 3 3
2 3 0 2
1
. . . . . . . . . . .
a) Compute the correlation coefficient. 2 7 1 6 9 4 4 8 7 2 6
d) What would be the slope and y-intercept for a regression line based on this
data?
146
STA 2100 Probability and Statistics I
the nearest dollar) and degree of customer satisfaction (on a scale of 1 - 10 with a 1
being not at all satisfied and a 10 being extremely satisfied). The researcher only
includes programs with comparable types of services. A sample of the data is
provided below.
Dollars 11 18 17 15 9 5 12 19 22 25
Satisfaction 6 8 10 4 9 6 3 5 2 10
b) What does this statistic mean concerning the relationship between amount of
money spent per month on internet provider service and level of customer
satisfaction?
10. It is hypothesized that there are fluctuations in norepinephrine (NE) levels which
accompany fluctuations in affect with bipolar affective disorder (manic-depressive
illness). Thus, during depressive states, NE levels drop; during manic states, NE
levels increase. To test this relationship, researchers measured the level of NE by
measuring the metabolite 3-methoxy-4-hydroxyphenylglycol (MHPG in micro gram
per 24 hour) in the patient's urine experiencing varying levels of mania/depression.
Increased levels of MHPG are correlated with increased metabolism (thus higher
levels) of central nervous system NE. Levels of mania/depression were also recorded
on a scale with a low score indicating increased mania and a high score increased
depression. The data is provided below.
MHPG 980 1209 1403 1950 1814 1280 1073 1066 880 776
Affect 22 26 8 10 5 19 26 12 23 28
b) What does this statistic mean concerning the relationship between MHPG
levels and affect?
d) What would be the slope and y-intercept for a regression line based on this
data?
e) What would be the predicted affect score if the individual had an MHPG level
of 1100? of 950? of 700?
147
STA 2100 Probability and Statistics I
4. Learning Activities
1. Daniel computed the following statistics based on the amount (X) in millions
(Kshs) that he invested in his cyber café business, and the income (Y) in
millions (Kshs) generated.
P P 2 P P P 2
n = 10, xi = 93, xi = 999, xi yi = 293, yi = 28, yi = 90
• Using the data, fit a linear regression line of the income (y) generated on
the amount (x) invested.
• Use the regression equation to determine how much Daniel would realize if
he invested Kshs 2.5M and comment on your results.
148