0% found this document useful (0 votes)
59 views15 pages

STA2100-Regression Analysis

Uploaded by

kigsboni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views15 pages

STA2100-Regression Analysis

Uploaded by

kigsboni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

STA 2100 Probability and Statistics I

Chapter 9
Relations (Simple Linear Regression)

Learning outcomes
Upon completing this topic, you should be able to:

• Define correlation, regression and know the link between these two concepts

• Know the assumptions of linear regression

• Calculate the equations of least squares regression lines, and use them to
estimate any given values for a set of data

• Interpret the meaning of the values obtained in the regression equation, and
how the values link to the correlation coefficient.

134
STA 2100 Probability and Statistics I

1. Introduction Regression
As indicated in introductory section of the previous lesson, we shall now be looking
at regression in this lesson, and specifically simple linear regression.
If two variables are significantly correlated, and if there is some theoretical basis
for doing so, it is possible to predict values of one variable from the other.
Regression analysis, in general sense, means the estimation or prediction of the
unknown value of one variable from the known value of the other variable. It is
one of the most important statistical tools which is extensively used in almost all
sciences – Natural, Social and Physical.
“Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data.”

Definitions

Regression analysisis a measure of the average relationship between two or


more variables in terms of the original units in which the data was given.
It provides a mathematically expression, an equation for estimating or pre-
dicting the values of one variable from the known of one or more other
variables.

Predictions

One of the primary advantages of knowing about a relationship between two vari-
ables is that one can use the knowledge to facilitate making predictions. Specif-
ically when one has exact knowledge of the individual score on one of the two
variables, then he/she can use the knowledge of the relationship to increase the
accuracy of a prediction of the individuals’ score on the other variable. As a note,
by the term prediction in this case, we mean a “best guess” of what a single
value of score will be. e.g. the value to be predicted can be the number of years
that a 62 yr old woman with high blood pressure will live or the time an individual
machine will take before it will break down. Therefore, a prediction is a guess
about the value of a term to be drawn from a specified population.

A predictor variable is one that provides relevant information for predicting what
scores will be on some other variable.

135
STA 2100 Probability and Statistics I

A predicted variable is one about which predictions are made. (An exact rela-
tionship between the predictor and predicted variable is very essential if we
are to make the most accurate predictions possible)

Simple Regression: Here, we use a single variable to estimate or predict another


variable.

Linear Regression Analysis: A regression analysis is called linear if the equation


of the method represents a straight line.

Curvilinear Regression Analysis: A regression Analysis is called curvilinear if


it represents a curve.

Multiple Regressions: This is a multivariate regression; that is it involves several


variables.

Independent Variable: This is the variable whose value is known.

Dependent Variable: this is the variable whose value is to be predicted by


convention, the independent variable is denoted by X and the dependent
variable by Y.

The simplest form of regression analysis, called simple linear regression or straight
line regression which involves the statistical modeling between a single input fac-
tor X (the “regerssor”) and a single output variable Y (the “response”).

1.1. Simple Linear (least squares) Regression Model


• I believe that at some point in your life, you have encountered the equation
of a straight line,y=mx+b.

• In this equation m is the “slope” of the line (change in y over change in


x) and b is the “intercept” of the line where the y-axis is intersected by the
line.

• The plot of a line with slope 2 and intercept 1 is depicted in the following
figure:

136
STA 2100 Probability and Statistics I

• Even though the straight-line model is perfect for algebra class, in the real
world finding such a perfect linear relationship is next to impossible.

• Most real world relationships are not perfectly linear models, but imperfect
models where the relationship between x and y is more like the correlation plot
we saw earlier.

• In this case, the question literally is “Where do you draw the line?”.
Simple linear regression is the statistical technique to correctly answer this
equation.

• Simple linear regression is the statistical model between X and Y in the real
world, where there is random variation associated with measured variable
quantities. To study the relationship between X and Y, the simplest rela-
tionship is that of a straight line, as opposed to a more complex relationship
such as a polynomial.

• Therefore in most cases we want to try to fit the data to a linear model.

• Plotting X versus Y, as we did in the correlation plot is a good first step

137
STA 2100 Probability and Statistics I

to determine if a linear modelis appropriate. This may reveal a a lot of


details about the data at hand.

• Sometimes data can be transformed (often by taking logarithms, square-root


or other mathematical methods) to fit a linear pattern if they do not do so
in the original measurement scale.

• Therefore, determining if a straight line relationship between X and Y is


appropriate is the first step.

• The second step, once it is determined that a linear model is a good idea,
is to determine the best fitting line that represents the relationship.

1.2. Assumptions of linear regression


Generally, to fit a regression modellinear or otherwise, some assumptions have
to be made and these include:

• For any given value of X, the true mean value of Y depends on X, which can
be written µy|x .

• In regression, the line represents mean values of Y not individual data values.
Each observation Yi is independent of all other, Yj 6= Yi

• We assume linearity between X and the mean of Y. The mean of Y is


determined by the straight line relationship which can be written as: µy|x =
β0 + β1 X or y∗ = a + bx where the betas are the slope and intercept of
the line.

• The variance of the Yi s is constant (homoscedasticity property).

• The response Yi s are normally distributed with a constant variance.

1.3. Fitting the Regression Model/Equation


• Consider the correlation plot between gene1 and gene2 as show in the
following figure.

138
STA 2100 Probability and Statistics I

1.0
0.5
gene2

0.0
−1.0

−1.0 −0.5 0.0 0.5

gene1

• We can simply take a ruler draw in a line (straight?), which, according to


our subjective eyes, best goes through the data.

• This method is subject to much error and is unlikely we will produce the
“best fitting” line. Therefore a more sophisticated method is needed.

• Regression analysis can be thought of as being sort of like the flip side of
correlation. It has to do with finding the equation for the kind of straight
lines we have just looked at.

• Suppose we have a sample of size n and it has two sets of measures, denoted
by x and y. We can predict the values of y given the values of x by using
the equation, y∗ = a + bx . Where,

b = ( (xi − x)(yi − y))/ (xi − x)2


P P

This can further be rearranged and expressed as,

x y)/n x2 − ( x)2
P P P P P
b = (n xy −

For a we have

a = y + bx

or rewritten as
P P
a = ( y − b x)/n

The symbol y∗ refers to the predicted value of y from a given value of x from
the regression equation.

139
STA 2100 Probability and Statistics I

• Suppose we have the linear equation y = 25 + 20x which gives the total
cost, y of a word processing job. Given the amount of time required,x, we
can use the equation to determine the exact cost of the job,y.

• However, things are not quite simple as in this case of word processing exam-
ple. So more often than not we have to be content with rough predictions.
In fact, for many circumstances, the variable being predicted will vary even
for a fixed value of the variable being used to make the prediction.

• For instance, we cannot predict the exact price of a Datsun Z cars by just
knowing the age . Indeed even for a fixed age, say three (3) years old, the
price of a Datsun Z varies from car to car.

Example. Suppose we have the following data on Age Vs Price of Datsun Z’s.
Age(yrs)5 7 6 6 5 4 7 6 5 5 2
Price 80 57 58 55 70 88 43 60 69 63 118
($100)
It’s useful to plot the data so that we can visualize the apparent relationship
between Age and price. Such plot is known as a scatter diagram.

140
STA 2100 Probability and Statistics I

From the diagram, it’s clear that the points are not on a straight line, but it’s
apparent they are clustered about a straight line. Hence, we fit a straight line
to the data, and then we could use that line to predict the price of Datsun Z’s.
Since it is possible to draw many reasonable looking straight lines through the
cluster of points, we need a method to choose the “Best” line. The method used
is known as the Least-square criterion.
So how does it work?
Simple illustration;
Suppose we have two lines A and B drawn for a set of plots in a scatter
diagram, say;
Line A: y = 0.5 + 1.25x
Line B: y = −0.25 + 1.5x
Then we have the following predicted values, and errors for the two lines as
follows:
x y yˆA e= e2A yˆB e= e2B
y− y−
yˆA yˆB
0 2 0.5 1.5 2.25 - 2.25 5.0625
0.25
1 4 1.75 2.25 5.0625 1.25 2.75 7.5625
2 6 3 3 9.00 2.75 3.25 10.5625
3 8 4.25 3.75 14.0625 4.25 3.75 14.0625
P 2 P 2
eA = eB =
30.375 37.25
Where,
x is the observed value of x
y is the observed value of y
eA is the error made if we use line A for prediction
eB is the error made of we use line B for prediction

• The rule for choosing the best line among several possible lines, is that we
choose the line with the smaller value of e . This line will give the best
P 2

fit for the data at hand. This may not be an easy task as we shall be forced
to draw all the possible lines, which may not also be possible. To solve the

141
STA 2100 Probability and Statistics I

problem, we use the regression equation formula as previously illustrated.


The line obtained using that equation/formula gives the line with least sum
of squares, hence the name least square regression!

• Hence, from the above examples of lines A and B, we would choose line A
as it has the least square error, i.e. its the line of best fit for the data if we
were to consider only these two lines.

• The least-squares criterion tells us what property the best-fitting line to a


set of data points must have, but it does not present a formula that permits
us to actually determine the best-fitting line to a set of data points. (Here
we just use the formula although the formula is derived using elementary
calculus)

1.4. Regression Equation:


As previously indicated, the equation of the best-fitting line (regression line) to a
set of data points is given by;
y∗ = a + bx
Where,

x y)/n x2 − ( x)2
P P P P P
b = (n xy −

or rewritten as
P P
a = ( y − b x)/n

We can then derive the line of best fit for the Datsun Z cars example, and also
answer the following questions;
Example. Refer to Age Vs Price data for the Datsun Z’s:

1. Determine the regression equation for the data; i.e. find the equation of the
regression line.

2. Describe the apparent relationship between Age and price for Datsun Zs

3. What does the slope of the regression equation represent in terms of the
prices for Datsun Zs?

142
STA 2100 Probability and Statistics I

4. Use the regression equation to predict the price for a two year-old Z and a
five-year old Z.

Solution

1. To determine the regression equation, we need to compute and using the


formulas above. It is therefore convenient to construct a table of values for

n, Σx, Σy, Σxy, Σx2

and their sums as presented below:

x y xy x2
5 80 400 25
7 57 399 49
6 58 348 36
6 55 330 36
5 70 350 25
4 88 352 16
7 43 301 49
6 60 360 36
5 69 345 25
5 63 315 25
2 118 236 4
58 761 3,736 326

The slope of the regression equation is therefore:

b = [11(3736) − (58)(761)]/11(326) − (58)2 = −13.7

While the intercept is:

a = (761 − (13.7)(58)/11 = 141.43

Thus, the regression equation for this data is:


ŷ = 141.43 − 13.7x

143
STA 2100 Probability and Statistics I

2. To graph the regression equation, we need only substitute two different x-


values to obtain two distinct points (why? ). Using the x-values x = 2 and
x = 8. The corresponding y-values are:

ŷ = 141.43 − 13.7(2) = 114.03


ŷ = 141.43 − 13.7(8) = 31.83
Consequently, the regression line passes through the two points (2, 114.03)
and (8, 31.83). The plot of these points should be shown on the diagram and also
include the data points as given in the table. This is the straight line that best fits
the data points according the least-squares criterion (i.e. the straight line whose
sum of squared errors is smallest)

3. Here, we are to describe the apparent relationship between age and price for
Datsun Zs. Since the slope of the regression line is negative, we see that
the price tends to decrease as age increases-Any surprises!

4. For this part, we are to interpret the slope of the regression equation in
terms of the prices for Datsun Zs. To begin, recall that represents age, in
years, and represents price, in hundred dollars. The slope of -13.70 or $1,370
indicates that Datsun Zs depreciate an estimated $1,370 per year, at least
in the two-to seven year-old range.

5. Finally, we are meant to use the regression equation ŷ = 141.43 − 13.7x to


predict the price for a two-year old Z and a five-year old Z. i.e. x = 2 and
x = 5 and Hence, predicted price is;

ŷ = 141.43 − 13.7(2) = 114.03


Or $114.03. Similarly, the price for a five year old Z is:
ŷ = 141.43 − 13.7(5) = 72.93
Or $7,293.
Remark 3. Warning on use of linear regression line: The idea behind regres-
sion line is based on the assumption that the data points are actually scattered
around a straight line. But data points can at times be scattered about a curve.
Unfortunately the formulas for a and b will work for this data set but fit an in
appropriate regression line to the data.

144
STA 2100 Probability and Statistics I

Remark 4. If you plan to find a regression line for a set of data points, first look at
a scatter diagram of the data. If data points do not appear to be scattered about
a straight line, do not determine a regression line.

2. Exercises

Exercise 22. Scores made by students in a statistics class in the mid-term and
final examination are given here. Develop a regression equation which may be used
to predict final examination scores from the mid – term score.
Student 1 2 3 4 5 6 7 8 9 10
Mid- 98 66 100 96 88 45 76 60 74 82
term
Final 90 74 98 88 80 62 78 74 86 80

3. Revision Questions
The following is a list of questions that will assist you in your revision.
Practice Problems:

1. Let variable X is the number of hamburgers consumed at a cook-out, and variable Y


is the number of beers consumed. Develop a regression equation to predict how
many beers a person will consume given that we know how many hamburgers that
person will consume.

Subject 1 2 3 4 5
Hamburgers 5 4 3 2 1
Beers 8 10 4 6 2

2. A horse owner is investigating the relationship between weight carried and the finish
position of several horses in his stable. Calculate r and R for the data given

Weight 11 11 12 11 11 11 11 12 10 10 11 11
carried
Position 0
2 3
6 0
3 5
4 0
6 5 7
4 3
2 6
1 8
4 0
1 0
3
Finishe
3. The top and bottom number which may appear on a die are as follows Calculate r
d
and R for these values. Are the results surprising?

Top 1 2 3 4 5 6

Bottom 5 6 4 3 1 2

4. Researchers interested in determining if there is a relationship between death


anxiety and religiosity conducted the following study. Subjects completed a death
anxiety scale (high score = high anxiety) and also completed a checklist designed to
measure an individual’s degree of religiosity (belief in a particular religion, regular
attendance at religious services, number of times per week they regularly pray, etc.)
(high score = greater religiosity . A data sample is provided below:

X 38 42 29 31 28 15 24 17 19 11 8 19 3 14 6
y 4 3 11 5 9 6 14 9 10 15 19 17 10 14 18

a) What is your computed answer?

b) What does this statistic mean concerning the relationship between death
anxiety and religiosity?

c) What percent of the variability is accounted for by the relation of these two
variables?

5. The data given below are obtained from student records.( Grade Point Average (x)
and Graduate Record exam score (y)) Calculate the regression equation and compute
the estimated GRE scores for GPA = 7.5 and 8.5..

Subject 11 12 13 14 15 16 17 18 19 20
X 8.3 8.6 9.2 9.8 8.0 7.8 9.4 9.0 7.2 8.6
y 2300 2250 2380 2400 2000 2100 2360 2350 2000 2260

145
STA 2100 Probability and Statistics I

6. A horse was subject to the test of how many minutes it takes to reach a point from
the starting point. The horse was made to carry luggage of various weights on 10
trials.. The data collected are presented below in the table. Find the regression
equation between the load and the time taken to reach the goal. Estimate the time
taken for the loads of 35 Kgs , 23 Kgs, and 9 Kgs. Are the answers in agreement with
your intuitive feelings? Justify.

Trial 1 2 3 4 5 6 8 8 9 10
Number 11
Weight 23 16 32 12 28 29 19 25 20
(in Kgs) 13
Time 22 16 47 13 39 43 21 32 22
taken
(in
7. A study was conducted
mins) to find whether there is any relationship between the weight
and blood pressure of an individual. The following set of data was arrived at from a
clinical study.

Serial 1 2 3 4 5 6 8 8 9 10
Number 78
Weight 86 72 822 80 86 84 89 68 71
Blood 140 160 134 144 180 176 174 178 128 132
Pressure
8. It is assumed that achievement test scores should be correlated with student's
classroom performance. One would expect that students who consistently perform
well in the classroom (tests, quizzes, etc.) would also perform well on a standardized
achievement test (0 - 100 with 100 indicating high achievement (x)). A teacher
decides to examine this hypothesis. At the end of the academic year, she computes a
correlation between the students achievement test scores (she purposefully did not
look at this data until after she submitted students grades) and the overall G.P.A.(y)
for each student computed over the entire year. The data for her class are provided
below.

X 98 96 94 88 01 77 86 71 59 6 8 7 7 7 8 8 7 9 9 6
Y 3.6 2.7 3.1 4.0 3.2 3.0 3.8 2.6 3.0 3 4 9
2 1 5 2
3 2 6 3
2 2 5 1 3 3
2 3 0 2
1
. . . . . . . . . . .
a) Compute the correlation coefficient. 2 7 1 6 9 4 4 8 7 2 6

b) What does this statistic mean concerning the relationship between


achievement test performance and G.P.A.?

c) What percent of the variability is accounted for by the relationship between


the two variables and what does this statistic mean?

d) What would be the slope and y-intercept for a regression line based on this
data?

e) If a student scored a 93 on the achievement test, what would be their


predicted G.P.A.? If they scored a 74? A 88?

9. With the growth of internet service providers, a researcher decides to examine


whether there is a correlation between cost of internet service per month (rounded to

146
STA 2100 Probability and Statistics I

the nearest dollar) and degree of customer satisfaction (on a scale of 1 - 10 with a 1
being not at all satisfied and a 10 being extremely satisfied). The researcher only
includes programs with comparable types of services. A sample of the data is
provided below.

Dollars 11 18 17 15 9 5 12 19 22 25
Satisfaction 6 8 10 4 9 6 3 5 2 10

a) Compute the correlation coefficient.

b) What does this statistic mean concerning the relationship between amount of
money spent per month on internet provider service and level of customer
satisfaction?

c) What percent of the variability is accounted for by the relationship between


the two variables and what does this statistic mean?

10. It is hypothesized that there are fluctuations in norepinephrine (NE) levels which
accompany fluctuations in affect with bipolar affective disorder (manic-depressive
illness). Thus, during depressive states, NE levels drop; during manic states, NE
levels increase. To test this relationship, researchers measured the level of NE by
measuring the metabolite 3-methoxy-4-hydroxyphenylglycol (MHPG in micro gram
per 24 hour) in the patient's urine experiencing varying levels of mania/depression.
Increased levels of MHPG are correlated with increased metabolism (thus higher
levels) of central nervous system NE. Levels of mania/depression were also recorded
on a scale with a low score indicating increased mania and a high score increased
depression. The data is provided below.

MHPG 980 1209 1403 1950 1814 1280 1073 1066 880 776

Affect 22 26 8 10 5 19 26 12 23 28

a) Compute the correlation coefficient.

b) What does this statistic mean concerning the relationship between MHPG
levels and affect?

c) What percent of the variability is accounted for by the relationship between


the two variables?

d) What would be the slope and y-intercept for a regression line based on this
data?

e) What would be the predicted affect score if the individual had an MHPG level
of 1100? of 950? of 700?

147
STA 2100 Probability and Statistics I

4. Learning Activities
1. Daniel computed the following statistics based on the amount (X) in millions
(Kshs) that he invested in his cyber café business, and the income (Y) in
millions (Kshs) generated.
P P 2 P P P 2
n = 10, xi = 93, xi = 999, xi yi = 293, yi = 28, yi = 90

• Using the data, fit a linear regression line of the income (y) generated on
the amount (x) invested.

• Use the regression equation to determine how much Daniel would realize if
he invested Kshs 2.5M and comment on your results.

148

You might also like