Assignment 6- STAT
Assignment 6- STAT
LINEAR REGRESSION
Overview: This unit deals with prediction and imperfect relationships, constructing the least-
squares regression line: Regression of X on Y regression of Y on X, Measuring
Predicting Errors: Standard Error of Estimate, Considerations in Using Linear
Regression for Prediction, Relation Between Regression Constants and Pearson r,
Multiple Regression
CHAPTER OUTLINE
I. Introduction.
A. Linear Regression. This topic deals with predicting scores of one distribution using
information known about scores on a second distribution. For example, one might
predict your height if they knew your weight and the nature of your relationship
between height and weight from a sample of other people.
B. Correlation. This refers to the magnitude and direction of the relationship between two
variables.
II. Linear Relationships
A. Linear. A Linear relationship between two variables is one in which the relationship
between two variables can accurately be represented by a straight line.
B. Curvilinear. When a curved line fits a set of points better than a straight line it is called a
curvilinear association or relationship.
C. Scatter plots. A scatter plot is a graph of paired X (one variable score) and Y (another
variable score) values. By visually examining the graph one can get a good idea of the
nature of the relationship between the two variables (i.e., linear or not).
III. Straight line equation.
A. General Equation. Y= bX + a where a= the Y intercept and b=the slope of the line.
B. Slope of the straight-line equation (b). The slope tells us how much the Y score changes
for each unit change in the X score. In equation form.
b=slope = ()/ ()
The slope is a constant value
C. Y intercept (a). The Y intercept is the value of Y where the line intersects the Y axis. It is
the value of Y when X=0.
D. Relationships.
1. Positive relationships. This indicates that there is a direct relationship between the
variables. Higher values of X are associated with higher values of Y and vice versa.
2. Negative relationships. This exists when there is an inverse relationship between X
and Y. low values of X are associated with high values of Y and vice versa.
3. Perfect relationship. This occurs when all the pairs of points fall on a straight line.
4. Imperfect relationships. This is when a positive or negative, relationship exists but
all of the points do not fall on the line.
IV. Least-Squares Regression Line for Prediction.
A. Least –Squares criterion. In an imperfect relationship o single straight line will hit all the
points. We pick the line that will minimize the total errors of prediction, i,e., construct
the one line minimize where Y’ is the predicted value of Y for any value of X.
=-Y
V. Prediction Errors. When relationships between X and Y variables are imperfect, there will be
prediction errors.
A. Standard error of estimate. (SylX). Quantifying the magnitude of the error involves
computing the standard error of estimate symbolized sylx. The standard error is much like
the standard deviation.
1. Definition. Gives a measure of the average deviation of the prediction errors about the
regression line.
2. Equation for standard error of estimate.
XY -
X Y 2
N
S SY
S SX
N-2
3. Interpretation. The larger the value of sy I x, the less confidence one has in the
prediction of Y given X. The smaller the value of sy I x, the more likely the prediction
will be accurate. If one constructed two parallel lines to the regression line at
distances of 1sy I x, 2sy Ix,and 3sy|x, one would find about 68%, 95% of the scores
would fall between the lines respectively.
B. Other errors. One must be careful of sources of errors in making predictions. There are two
major considerations in making predictions.
1. Linearity. The original relationship needs to be linear for accurate prediction using linear
regression.
2. Prediction in the range. Generally, one uses a sample to generate the data for calculating
the regression constants (by and ay). Prediction of Y should be based on values of X within
the range of the sample upon which the constants are based.
CONCEPT REVIEW
It is often useful to use knowledge of one variable to predict a likely value on a second
variable. If there is a relationship between two variables, we can use knowledge of this
relationship for prediction the name of this topic which covers this material for linear
relationships is linear regression the easiest way to determine if a relationship exists between
two variables is to plot the variables on a graph. Such a plot is called a scatter plot. A scatter
plot is a graph of paired X and y scores. When a straight line accurately describes the
relationship between two variables, the relationship is called linear. Not all relationships are
linear. Those that are not called curvilinear. In these cases, a curved line fits the points better
than a straight line. Although graphic solution is sometimes used for prediction, it is more
common to predict Y from the equation of the straight line. The general form of the equation
for a straight line is:
Y = b times X + a
Where a= the Y intercept and b = the slope of the line. The Y-intercept is the value of Y where
the line intersects the Y axes. Thus, it is the value of Y when X = 0. The slope of a line measures
its rate of change. The slope tells how much the & score changes for each unit change in the X
score. In straight line functions, the slope has a constant value for any points on the line. In
conceptual terms the equation for the slope is;
Y Y2 Y1
b
X X 2 X 1
If one had two pairs of points (10, 20) and (15, 30), the slope for the line connecting these
points would be:
30 - 20
b 2
15 - 10
Relationships between two variables may be either positive or negative. If the relationship is
positive, the slope is positive. If the slope is negative the relationship is negative. In a positive
relationship higher values of X are associated with higher values of Y. In a negative relationship
lower values of X are associated with higher values of Y. On a graph, a negative slope would run
downward from left to right. In a negative relationship as X increase, Y decrease. In a positive
relationship as X increase Y increases.
In an imperfect relationship, all the points do not fall on the regression line. In an imperfect
relationship one constructs the line which minimizes errors of prediction according to test –
squares criterion. This is called the least – squares, regression line. The vertical distance
between the regression line and each point represents the error in prediction. Y equals the
predicted Y value and X equals the actual value of Y. Y equals the error for each point. The least
squares regression line minimizes.
The terms are called regression constants. The regression line for predicting Y given X is
constructed by computing values forand .
The computational formula for computing is:
X Y
XY - N
bY
X 2
X 2
N
N is equal number of paired scores. The regression constant is given by the equation:
= - by
Since we need to know the value of to determine the constant, we first find , then . Once they
are both found they are substituted into the regression equation. The above regression
constants are for the values of the regression line of Y on X. It is some tomes of interest to
predict X given Y. This is called the regression line of X on Y. The linear regression equation for
predicting X given Y is:
X= times Y +
This regression line is constructed by calculating values for and . The computational formula
for is:
X Y
XY - N
bX
Y 2
Y 2
N
The regression line of Y on X will equal the regression line of X on Y only when the relationships
are perfect.
Sy∣x =
XY -
X Y 2
N
S SY
S SX
SY∣X N-2
For predicting Y given X. The standard error of estimate is computed over all Y scores.
For it to be meaningful one assumes that the variability of Y remains constant as one
goes from one X score to the next. This assumption is called the assumption of
homoscedasticity. In general one would expect to find 68% of the points to fall within
1sy/x of the regression line, 95% of the points to fall within 2 sy/x, and 99% to fall
within 3sy/x.
In general it is appropriate to use linear regression to predict values only if the
relationship is linear. It is also important that the basic computation or sample group be
representative of the prediction group. In others words, the data collected to compute
the regression constants should be a random sample from the population of interest.
Finally, the linear regression equation is properly used just for the range of the variable
upon which it is based. This is because we do not know if data outside the range of our
sample continues to be a linear relationship.
EXERCISES
1. X represents aptitude test scores and Y represents grade point average in college. If the least-
square regression line for the relationship between these two variables is Y = .005X + 1.2,
what GPA would you predict for people who scored each of the following scores on the
aptitude test?
a. 159
b. 300
c. 500
d. 550
2. Draw a graph of aptitude test score versus grade point average and construct the regression
line for the line =.005 X + 1.2.
3. A professor wanted to predict final exam scores from midterm exam scores. He used data
from several different professors teaching the same class. He obtained the following data:
______________________________________________
Midterm Scores: 83, 62, 72, 85, 85 - X
Final Exam Scores: 89, 58, 70, 92, 84 - Y
a. - 387
. – 30,367
c. N. 5
d. . - 393
e.. 31,705
f.(. 149,769
g.(. 154,449
h. ∑XY
i. by
j. ay
k. If the professor’s class score on the midterm was 77.4. what score would you predict
the class would receive on the final exam?
4. A hospital administrator wanted to predict the number of patients her hospital would admit
in 1990. The following data were obtained from past records:
Year: 1960, 1965, 1970, 1975, 1980
Number of Admission: 812, 983, 1127, 904, 1768
a. What would the best prediction be for the number of admissions expected in 1990?
b. What serious caution should the administrator be aware of when making her prediction?
5. A psychologist wanted to use a locus of control test to predict scores on a depression scale.
The following data were summarized for the relationship between the locus of control and
depression scale:
X 2 4 8 14 20 23 25
Y 2 6 14 20 12 9 7
X 21 29 33 40 50
Y 34 36 42 45 58
X 9 15 25 27 42 50 30
Y 14 11 5 5 0 -8 1
TRUE-FALSE QUESTIONS
1. The easiest way to determine if a relationship is linear is to calculate the regression line.
3. In a perfect linear relationship all the points must fall on the straight line.
5. In a straight line the slope approaches zero as the line comes near the point X.
6. In an inverse relationship as one variable gets larger the other variable gets smaller.
10. Generally, one can use the same regression equation for predicting Y given X as for X given
Y.
11. If the relationship between two variables is perfect the standard error of estimate equals 0.
12. If the standard error of estimate for relationship 1 equals 5.26 and for relationship 2 it
equals 8.01 then we can reasonably infer that relationship 2 is less perfect than relationship
1.
13. It is impossible to have a negative value for the standard error of estimate.
14. In general one is less confident in predictions of Y when the value of X used for the
prediction is outside the range of the original data used to construct the regression line.
15. If the regression line is parallel to the X axis then the slope of the regression line equals 0.
16. The regression line will always go through the point , .
MULTIPLE CHOICE.
4. In a particular relationship N = 80. How many points would you expect on the average to find
within ± 1 sY│X of the regression line?
a. 40
b. 80
c. 54
d. 0
5. What would you predict for the value of Y for the point where the value of X is ?
a. cannot be determined from information given
b. 0
c. 1
d.
6. If the value of sY│X = 4.00 for relationship A and sY│X = 5.25 for relationship B, in which
relationship would you have most confidence in a particular prediction?
a. A
b. B
c. it makes no difference
d. cannot be determined from information given
10. If the value for aY is negative, the relationship between X and Y is ____________.
a. positive
b. negative
c. inverse
d. cannot be determined from information given
13. The points (0,5) and (5,10) fall on the regression line for a perfect positive linear
relationship. What is the regression equation for this relationship?
a. Y’ = X + 5
b. Y’ = 5X
c. Y’ = 5X + 10
d. cannot be determined from information given
14. For the following points what would you predict to be the value of Y’ when X = 19? Assume
a linear relationship.
X 6 12 30 40
Y 10 14 20 27
a. 16.35
b. 24.69
c. 22.00
d. 17.75
15. If N = 8, Σ X = 160, Σ X2 = 4656, Σ Y = 79, Σ Y2 =1309, and Σ XY = 2430, what is the value of bY?
a. .9217
b. -1.8010
c. .5838
d. .7922
16. What is the slope for the points X1 = 30, Y1 = 50 and X2 = 25 and Y2 = 40?
a. 2.00
b..50
c. -2.00
d. -.50
17. If the regression equation for a set of data is Y’ = 2.650X + 11.250 then the value of Y’ for X =
33 is __________.
a. 87.45
b. 371.25
c. 98.70
d. 76.20
18. If X 57.2, Y 84.6, and bY =.37, the value of aY = __________
a. 141.80
b. -25.90
c. 63.44
d. 27.40
19. If the regression line for predicting X given Y were X’ = 103Y + 26.2, what would the value of
X’ be if Y = 0.2?
a. 129.2
b. 25.8
c. 5.2
d. 46.8