0% found this document useful (0 votes)
129 views39 pages

Lesson 12 - Introduction To Regression and Correlation Analysis Regression Analysis

This document provides an introduction to regression and correlation analysis. It defines a simple linear regression model that predicts a dependent variable (Y) based on an independent variable (X). It describes the regression parameters (intercept β0 and slope β1), assumptions, and properties of the regression line. Formulas are given for estimating the parameters β0, β1, and error variance σ2 from sample data using the least squares method. Test statistics are described for testing hypotheses about the parameters. Correlation analysis is introduced as a way to visualize the strength of relationships between variables using scatter plots.

Uploaded by

Fe Gregorio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views39 pages

Lesson 12 - Introduction To Regression and Correlation Analysis Regression Analysis

This document provides an introduction to regression and correlation analysis. It defines a simple linear regression model that predicts a dependent variable (Y) based on an independent variable (X). It describes the regression parameters (intercept β0 and slope β1), assumptions, and properties of the regression line. Formulas are given for estimating the parameters β0, β1, and error variance σ2 from sample data using the least squares method. Test statistics are described for testing hypotheses about the parameters. Correlation analysis is introduced as a way to visualize the strength of relationships between variables using scatter plots.

Uploaded by

Fe Gregorio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 39

LESSON 12 - INTRODUCTION TO REGRESSION AND CORRELATION ANALYSIS

Regression Analysis

We begin our discussion of regression analysis by considering inferences about


the regression parameters for the simple linear regression model. The model can be
stated as follows:
Yi   o   1 X i   i (2.1)
where:

Yi is the value of the response variable in the ith trial


 o and  1 are parameters
Xi is a known constant, namely, the value of the independent variable in the ith
trial
 i is a random error term with mean E   i  = 0 and the variance  2  i    2 ;  i
and  j are uncorrelated so that the covariance   i ,  j  =0 for all I, j; i i  j , i =
1, 2, 3, . . . .n

Regression model (2.1) is said to be simple, linear in the parameters, and linear
in the independent variable. It is “simple” in that there is only one independent variable,
“linear in the parameters” because no parameter appears as an exponent or is multiplied
or divided by another parameter, and “linear in the independent variable” because this
variable appears only in the first power. A model that is linear in the parameters and the
independent variable is also called a first-order model.

In regression analysis, a mathematical equation is used to predict the value of


the dependent variable (denoted Y) on the basis of the independent variable (denoted X)
as in (2.1).

The term  o is called the y-intercept. It refers to the expected level of Y when X
= 0 (no priors). This is the base line amount because it is what Y is before we take the
level of X into account.

The term  1 is called the slope (or the regression coefficient) for X. This
represents the amount that Y changes (increases or decreases) for each change of one
unit in X. Thus, for example, the difference in sentence length between a defendant with
X = 0 (no priors) and X = 1 (one prior) and the defendant with X = 2 (two priors) is 2  i .

Finally,  i is called the error term or disturbance term. It represents the amount
of the sentence that cannot be accounted for by  o and  1 X. In other words,  i
represents the departure of a given defendant’s sentence from that which would be
expected on the basis of his number of priors (X).

Let X and Y be taken from a bivariate distribution where Y variable (dependent) is


assumed to be normal and have common variance. For each value of Y, a
corresponding value of X is also obtained and assumed to be a fixed variate or
independent variate. In general, the assumptions underlying the regression analysis
are:

1
1. The variable X is fixed or a predetermined variable.
2. The variable Y’s have normal distribution and have the same variances.
3. The errors (ei) associated with each value of Y is also normal with mean zero and
variance 2.

Requirements for Regression

The assumptions underlying regression are:


a. It is assumed that both variables are measured at the interval scale.
b. Regression assumes a straight-line relationship. If this is not the case,
there are various transformations that can be used to make the relationship
into a straight line. Also, if extremely deviant cases are observed in a scatter
plot, these should be removed from the analysis.
c. Sample members must be chosen randomly in order to employ test of
significance.
d. To test the significance of the regression line, one must also assume
normality for both variables or else have a large sample.

Properties of Fitted Regression Line

To find “good” estimators of the regression parameters 0 and 1 we shall employ


the method of least squares. For each sample observation (X i,Yi), the method of least
squares considers the deviation Yi from its expected value:
Yi – (  o +  1 Xi) (2.2)
The regression line fitted by the method of least squares has a number of
properties worth noting:
1. The sum of the residual is zero:   i  0
2. The sum of the squared residuals,   i , is a minimum. This was the
2

requirement to be satisfied in deriving the least squares estimators of the


regression parameters.
3. The sum of the observed values Yi equals the sum of the fitted values Yˆi :
Y i  Yˆi
4. The sum of the weighted residuals is zero when the residual in the ith trial is
weighted by the level of the independent variable in the ith trial:  X i  i  0
5. The sum of the weighted residuals is zero when the residual in the ith trial is
weighted by the fitted value of the response variable for the ith trial:
 Yˆi  i  0
6. The regression line always goes through the point  X , Y 
Estimation of the Parameters  0 ,  1 , and  2

The values of the parameters in the regression equation are often times not
known, the common practice is to obtain sample paired observations (X,Y) and then
estimate the above mentioned parameters using the following formulas:

2
 X Y
XY   X  X Y  Y 
i i
i i 
b1  n 
i i

 X  X 
(estimator of 1)
 Xi 
2 2

 2i  i

bo 
1
  Yi  b1  X i   Y  b1  (estimator of 0)
n

Y  2
   X   Y  
 Yi   b1  X i Yi 
2 i i i

n n (estimator of  ) 2

S 2
    MSE  SSE
x
y n2 n2

To test the significance of H0: 0=0 and H0: 1 = 1, the following test statistics
may be computed:

 bo   o 
t
SE  bo 
where:
1 
X 2 2 2
X nX
MSE   
i
SE (b0) = MSE  S 2

n  X  X    x
2
n X i  X 
2
y SSX
i 

 X  2

SSX   X  2

Decision Rule: Reject H0 if | tc |  t/2(n-2)


To test the significance of H0: 1 =1 the test statistic is

 b1  1 
t
SE  b1 
where:
S 2x
SE(b1) = y

SSX

Decision Rule: Reject H0 if | tc |  t/2,(n-2)


Correlation Analysis

Correlation actually varies with respect to their strength. We can visualize


differences in the strength of correlation by means of a scatter plot or scatter diagram, a

3
graph that shows the way scores on any two variables, X and Y, are scattered
throughout the range of possible score values.

Such correlation coefficients generally range between –1.00 and +1.00 as


follows:

RANGE DESCRIPTIVE EQUIVALENT


-1.00 Perfect negative correlation
-0.60 Strong negative correlation
-0.30 Moderate negative correlation
-0.10 Weak negative correlation
0.00 No correlation
+0.10 Weak positive correlation
+0.30 Moderate positive correlation
+0.60 Strong positive correlation
+1.00 Perfect positive correlation

Pearson’s Correlation Coefficient

r
 XY  n X Y 
n XY   X  Y

 X  n X  Y
2 2 2
 nY
2
 n  X 2 2

   X  n Y 2    Y 
2

Requirements for the Use of Pearson’s Correlation Coefficient
1. A straight-line relationship. Pearson’s r is only useful for detecting a straight-
line correlation between X and Y.

2. Interval Data. Both X and Y variables must be measured at the interval level,
so that scores may be assigned to the respondents.

3. Random sampling. Sample members must have been drawn at random from
a specified population to apply a test of significance.

4. Normally distributed characteristics. Testing the significance of Pearson’s r


requires both X and Y variables to be normally distributed in the population.
In small samples, failure to meet the requirement of normally distributed
characteristics may seriously impair the validity of the test. However, this
requirement is of minor importance when the sample size equals or exceeds
30 cases.

A test of the special hypothesis   0 versus an appropriate alternative


equivalent to testing   0 for the simple linear regression model and therefore either
the t-test with n –2 degrees of freedom or the F-test with 1 and n –2 degrees of freedom
are applicable. However, if one wishes to avoid the analysis of variance procedure and
compute only the sample correlation coefficients, it can be easily verified that the t-value
is given by

r n2
t
1 r2

4
SEATWORK NO.2

The following data give infant mortality rates (deaths 1 year or less after birth, per
1000 live births) for a period of 11 years in certain country.

Year (X) 1 2 3 4 5 6 7 8 9 10 11
Infant Mortality Rate
68.5 65.8 65.2 65.5 64.1 59.3 62 67.9 64.7 58.7 53.3
(Y)

a.Determine the slope of the line if the data is fitted to a straight line by least square
method. What does the slope indicate?
b.Compute the coefficient of determination.

5
c.Assuming that the factors affecting mortality rate to be constant, estimate the infant
mortality rate of the country 20 years from the last recorded data.

COMPUTATIONS:

X Y X2 Y2 XY
1 68.50 1 4692.3 68.5
2 65.80 4 4329.6 131.6
3 65.20 9 4251 195.6
4 65.50 16 4290.3 262
5 64.10 25 4108.8 320.5
6 59.30 36 3516.5 355.8
7 62.00 49 3844 434
8 67.90 64 4610.4 543.2
9 64.70 81 4186.1 582.3
10 58.70 100 3445.7 587
11 53.30 121 2840.9 586.3

66 695 506 44115.56 4066.8

a. slope of the line, bo

nXY  XY
b1 =
  X 
2
nX 2

11 4066.8  66 695


=
11 506   66 2
b1 = -0.9382

b0 = Y  b1 X
695  66 
=  0.9382 
11  11 
bo = 68.811

Y = b0 + b1 X
Y =68.811 – 0.9382X

b. Coefficient of determination, r2

nXY  XY
r=
nX 2
 
 X 2 nY 2  Y 2   =

11 4066.8  66 695


11 506   66 11 44115.56   695 
2 2

6
r = -0.689
r2 = (-0.689)2
r2 = 0.474
2
r = 47.4%

c. If X = 31 years then Y = 68.811 – 0.9382(31) = 39.73

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.

Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
1 , 68.50 M+
2 , 65.80 M+
3 , 65.20 M+
4 , 65.50 M+
5 , 64.10 M+
6 , 59.30 M+
7 , 62.00 M+
8 , 67.90 M+
9 , 64.70 M+
10 , 58.70 M+
11 , 53.30 M+

To obtain different statistic PRESS

Shift 1 1 = 506. X 2

Shift 1 2 = 66. X
Shift 1 3 = 11. n
Shift 2 1 = 6. X

7
Shift 2 2 = 3.16227766 X n
Shift 2 3 = 3.31662479 X n 1

Shift 1  1 = 44,115.56 Y 2

Shift 1  2 = 695. Y
Shift 1  3 = 4,066.8  XY
Shift 2  1 = 63.1818181818 Y
Shift 2  2 = 4.308515497 Yn
Shift 2  3 = 4.518809175 Yn 1

Shift 2   3 = -0.688587845 r

Compare the different values obtained in the summary table above.

WORKSHEET No. 2
INTRODUCTION TO REGRESSION AND CORRELATION ANALYSIS
1. The Statistics Consulting Center at Virginia Polytechnic Institute and State
University analyzed data on normal woodchucks for the Department of Veterinary
Medicine. The variables of interest were body weight in grams and heart weight in
grams. It was also of interest to develop a linear regression equation in order to
determine if there is a significant linear relationship between heart weight and total
body weight. Use heart weight as the independent variable and the body weight as
the dependent variable and fit a simple linear regression using the following data. In
addition, test the hypothesis Ho:   0 against   0
Summary Table (Computations)
Body Weight Heart Weight
Y2 X2 XY
(grams) Y (grams) X
4050 11.2 16402500 125.44 45360.00
2645 12.4 6996025 153.76 32798.00
3120 10.5 9734400 110.25 32760.00

8
5700 13.2 32490000 174.24 75240.00
2595 9.80 6734025 96.04 25431.00
3640 11.0 13249600 121.00 40040.00
2050 10.8 4202500 116.64 22140.00
4235 10.4 17935225 108.16 44044.00
2935 12.2 8614225 148.84 35807.00
4975 11.2 24750625 125.44 55720.00
3690 10.8 13616100 116.64 39852.00
2800 14.2 7840000 201.64 39760.00
2775 12.2 7700625 148.84 33855.00
2170 10.0 4708900 100.00 21700.00
2370 12.3 5616900 151.29 29151.00
2055 12.5 4223025 156.25 25687.50
2025 11.8 4100625 139.24 23895.00
2645 16.0 6996025 256.00 42320.00
2675 13.8 7155625 190.44 36915.00
59150 226.30 203066950 2740.15 702475.5

Summary:
N = 19
X  226.3 Y  59,150 X 2
 2,740.15
226.3
Y 2
 203,066,950  XY  702,475.5 X 
19
 11 .911
59,150
Y   3,113 .158
19

n XY   X  Y 19(702,475.5)  ( 226.3)(59,150)  38,610.5


b1     45.362
n X    X  19(2,740.15)  ( 226.3) 2
2
2 851.16
bo  Y  b1 X  3,113.158  45.36211.911  3,653.465

Regression Equation: Yˆ  bo  b1 X  3,653.465  45.362 X

9
n XY   X  Y
r
n  X 2
  X 
2
n Y 2
 Y 
2

19 702,475.5   226.3 59,150 

19 2,740.15   226.3 19 203,066,950   59,150 
2 2

 38,610.5

306,034,194,978
 0.07

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
11.2 , 4050 M+
12.4 , 2645 M+
10.5 , 3120 M+
13.2 , 5700 M+
9.80 , 2595 M+
11.0 , 3640 M+
10.8 , 2050 M+
10.4 , 4235 M+
12.2 , 2935 M+
11.2 , 4975 M+
10.8 , 3690 M+
14.2 , 2800 M+
12.2 , 2775 M+
10.0 , 2170 M+
12.3 , 2370 M+

10
12.5 , 2055 M+
11.8 , 2025 M+
16.0 , 2645 M+
13.8 , 2675 M+

To obtain different statistic PRESS

Shift 1 1 = 2,740.15 X 2

Shift 1 2 = 226.3 X
Shift 1 3 = 19 n
Shift 2 1 = 11.91052632 X

Shift 1  1 = 203,066.950 Y 2

Shift 1  2 = 59,150. Y
Shift 1  3 = 702,475.5  XY
Shift 2  1 = 3,113.157895 Y

Shift 2   1 = 3,653.445709 A
Shift 2   2 = -45.36221157 B
Shift 2   3 = -0.069794379 r

Yˆ  A  BX  3,653.45  45.36 X
where: bo = A and b1 = B

a) Ho:   0 (The slope of the regression line is equal to zero)


Ha:   0 (The slope of the regression line is not equal to zero)
b) Critical region and Level of Significance: Let   0.05 be the level of

t  t 0.025,17  2.110
significance, thus , n  2 
2

c) Computation:

11
Y  2
   X   Y  
Y 2

n
 b1  XY 
 n


S 2x 
y n2

203,066.950 
 59,150 2  1,798.864  702,475.5   226.3 59,150 
19   19 

19  2
2,020,274.238
  118 ,839.661
17

SSX   X 2

 X  2

 2,740.15 
 226.3 2  44.798
n 19

S 2x
SE  b1   y

118,839.661
 2,652.796  51.505
SSX 44.798

t
 b1  1    0.754  0.015
SE  b1  51.505
d) Decision: As shown in the figure below, the computed t-value is still located
on the acceptance region. Thus, the null hypothesis is accepted.

-2.110 +2.110
-0.015
e) Conclusion: Therefore, the slope of the regression line is not different from
zero. Meaning, the regression equation could not be used to
estimate the body weight of an individual if the heart weight
is given.

2. The data in patient’s satisfaction in exercise No. 4 of worksheet No. 3 page


250-251, determine the relationship of the predictor variable (level of patient’s
satisfaction) to the three independent variables namely patient’s age, index of
severity of illness and index of level of anxiety. (Use separate sheets of paper
for the complete solutions)
X1 = patient’s age, X2 = patient’s index of severity of illness,
X3 = patient’s index of level of anxiety, Y = level of patient’s satisfaction

X1 Y X12 Y2 XY
50 48 2500 2304 2400
36 57 1296 3249 2052

12
40 66 1600 4356 2640
41 70 1681 4900 2870
28 89 784 7921 2492
49 36 2401 1296 1764
42 46 1764 2116 1932
45 54 2025 2916 2430
52 26 2704 676 1352
29 77 841 5929 2233
29 89 841 7921 2581
43 67 1849 4489 2881
38 47 1444 2209 1786
34 51 1156 2601 1734
53 57 2809 3249 3021
36 66 1296 4356 2376
33 79 1089 6241 2607
29 88 841 7744 2552
33 60 1089 3600 1980
55 49 3025 2401 2695
29 77 841 5929 2233
44 52 1936 2704 2288
43 60 1849 3600 2580
911 1411 37661 92707 53479

n XY   X  Y
r
n  X 2
  X 
2
n Y 2
 Y 
2

23 53,479   9111,411  55,404
   0.774
23 37,661   911 23 92,707  1,411 
2 2
5,128,097,880

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.

13
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
50 , 48 M+
36 , 57 M+
40 , 66 M+
41 , 70 M+
28 , 89 M+
49 , 36 M+
42 , 46 M+
45 , 54 M+
52 , 26 M+
29 , 77 M+
29 , 89 M+
43 , 67 M+
38 , 47 M+
34 , 51 M+
53 , 57 M+
36 , 66 M+
33 , 79 M+
29 , 88 M+
33 , 60 M+
55 , 49 M+
29 , 77 M+
44 , 52 M+
43 , 60 M+

To obtain different statistic PRESS

Shift 1 1 = 37,661. X 2

Shift 1 2 = 911. X
Shift 1 3 = 23 n
Shift 2 1 = 39.60869565 X

14
Shift 1  1 = 92,707. Y 2

Shift 1  2 = 1,411. Y
Shift 1  3 = 53,479.  XY

Shift 2   3 = -0.773682845 r

The level of patient’s satisfaction is negatively correlated with the patient’s age.
The degree of association is moderately correlated.

X2 Y X22 Y2 XY
51 48 2601 2304 2448
46 57 2116 3249 2622
48 66 2304 4356 3168
44 70 1936 4900 3080
43 89 1849 7921 3827
54 36 2916 1296 1944
50 46 2500 2116 2300
48 54 2304 2916 2592
62 26 3844 676 1612
50 77 2500 5929 3850
48 89 2304 7921 4272
53 67 2809 4489 3551
55 47 3025 2209 2585
51 51 2601 2601 2601
54 57 2916 3249 3078
49 66 2401 4356 3234
56 79 3136 6241 4424
46 88 2116 7744 4048
49 60 2401 3600 2940
51 49 2601 2401 2499
52 77 2704 5929 4004
58 52 3364 2704 3016
50 60 2500 3600 3000
1168 1411 59748 92707 70695
X Y X 2
Y 2
 XY

15
n XY   X  Y
r
n  X 2
  X 
2
n Y 2
 Y 
2

23 70,695  1,1681,411  22,063
   0.587
23 59,748  1,168 23 92,707   1,411 
2 2 37,557.59843

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
51 , 48 M+
46 , 57 M+
48 , 66 M+
44 , 70 M+
43 , 89 M+
54 , 36 M+
50 , 46 M+
48 , 54 M+
62 , 26 M+
50 , 77 M+
48 , 89 M+
53 , 67 M+
55 , 47 M+
51 , 51 M+
54 , 57 M+
49 , 66 M+

16
56 , 79 M+
46 , 88 M+
49 , 60 M+
51 , 49 M+
52 , 77 M+
58 , 52 M+
50 , 60 M+

To obtain different statistic PRESS

Shift 1 1 = 59,748. X 2

Shift 1 2 = 1,168. X
Shift 1 3 = 23 n
Shift 2 1 = 50.7826087 X

Shift 1  1 = 92,707. Y 2

Shift 1  2 = 1,411. Y
Shift 1  3 = 70,695.  XY
Shift 2   3 = -0.587444376 r

The degree of association between patient’s index of severity of illness and


patient’s level of satisfaction is very weak. The two variables are negatively correlated.

X3 Y X32 Y2 XY
2.3 48 5.29 2304 110.4
2.3 57 5.29 3249 131.1
2.2 66 4.84 4356 145.2
1.8 70 3.24 4900 126
1.8 89 3.24 7921 160.2
2.9 36 8.41 1296 104.4
2.2 46 4.84 2116 101.2
2.4 54 5.76 2916 129.6
2.9 26 8.41 676 75.4
2.1 77 4.41 5929 161.7
2.4 89 5.76 7921 213.6

17
2.4 67 5.76 4489 160.8
2.2 47 4.84 2209 103.4
2.3 51 5.29 2601 117.3
2.2 57 4.84 3249 125.4
2 66 4 4356 132
2.5 79 6.25 6241 197.5
1.9 88 3.61 7744 167.2
2.1 60 4.41 3600 126
2.4 49 5.76 2401 117.6
2.3 77 5.29 5929 177.1
2.9 52 8.41 2704 150.8
2.3 60 5.29 3600 138
52.8 1411 123.24 92707 3171.9

n XY   X  Y
r
n  X 2
  X 
2
n Y 2
 Y 
2

23 3,171.9   52.81,411  1,547.1
   0.60
23123.24   52.8 23 92,707  1,411 
2 2
6,597,751.2

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
2.3 , 48 M+
2.3 , 57 M+
2.2 , 66 M+
1.8 , 70 M+

18
1.8 , 89 M+
2.9 , 36 M+
2.2 , 46 M+
2.4 , 54 M+
2.9 , 26 M+
2.1 , 77 M+
2.4 , 89 M+
2.4 , 67 M+
2.2 , 47 M+
2.3 , 51 M+
2.2 , 57 M+
2 , 66 M+
2.5 , 79 M+
1.9 , 88 M+
2.1 , 60 M+
2.4 , 49 M+
2.3 , 77 M+
2.9 , 52 M+
2.3 , 60 M+

To obtain different statistic PRESS

Shift 1 1 = 123.24 X 2

Shift 1 2 = 52.8 X
Shift 1 3 = 23 n
Shift 2 1 = 2.295652174 X

Shift 1  1 = 92,707. Y 2

Shift 1  2 = 1,411. Y
Shift 1  3 = 3,171.9  XY
Shift 2   3 = -0.602310478 r

The degree of association between patient’s index of level of anxiety and


patient’s level of satisfaction is moderately and negatively correlated.

3. Grade Point Average. The director of admissions of a small Nursing school


administered a newly designed entrance test to 20 students selected at random

19
from the new freshman class in a study to determine whether a student’s grade
point average (GPA) at the end of the freshman year (Y) can be predicted from
the entrance test score (X). The results of the study follow.
a) Compute the slope and the intercept of the regression line.
b) Test the significance of the slope (Use  = .05)
c) Calculate r
d) Test the null hypothesis that   0 against the alternative that  > 0 at the 0.01
level of significance.
e) What percentage of the variation in the GPA is explained by difference in the
entrance test scores?

Summary Table (Computations)


X Y X2 Y2 XY
5.50 3.10 30.25 9.61 17.05
4.80 2.30 23.04 5.29 11.04
4.70 3.00 22.09 9.00 14.1
3.90 1.90 15.21 3.61 7.41
4.50 2.50 20.25 6.25 11.25
6.20 3.70 38.44 13.69 22.94
6.00 3.40 36.00 11.56 20.4
5.20 2.60 27.04 6.76 13.52
4.70 2.80 22.09 7.84 13.16
7.30 1.60 53.29 2.56 11.68
4.90 2.00 24.01 4.00 9.8
5.40 2.90 29.16 8.41 15.66
5.00 2.30 25.00 5.29 11.5
6.30 3.20 39.69 10.24 20.16
4.60 1.80 21.16 3.24 8.28
4.30 1.40 18.49 1.96 6.02
5.00 2.00 25.00 4.00 10.0
5.90 3.80 34.81 14.44 22.42
4.10 2.20 16.81 4.84 9.02
4.70 1.50 22.09 2.25 7.05
103 50 543.92 134.84 262.46
Summary:
N = 20 X  103 Y  50.0 X 2
 543.92

20
103
Y 2
 134.84  XY  262.46 X 
20
 5.15
50
Y   2.50
20
n XY   X  Y 20(262.46)  (103)(50.0) 99.2
b1     0.368
n X    X  20(543.92)  (103)
2 2
2 269.4

bo  Y  b1 X  2.50  0.368 5.15  0.605


Regression Equation = Yˆ  b  b Xˆ  0.605  0.368 Xˆ
o 1

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
5.50 , 3.10 M+
4.80 , 2.30 M+
4.70 , 3.00 M+
3.90 , 1.90 M+
4.50 , 2.50 M+
6.20 , 3.70 M+
6.00 , 3.40 M+
5.20 , 2.60 M+
4.70 , 2.80 M+
7.30 , 1.60 M+
4.90 , 2.00 M+
5.40 , 5.40 M+
5.00 , 5.00 M+
6.30 , 6.30 M+
4.60 , 4.60 M+
4.30 , 4.30 M+

21
5.00 , 5.00 M+
5.90 , 5.90 M+
4.10 , 4.10 M+
4.70 , 4.70 M+

To obtain different statistic PRESS

Shift 1 1 = 543.92 X 2

Shift 1 2 = 103. X
Shift 1 3 = 20. n
Shift 2 1 = 5.15 X

Shift 1  1 = 134.84 Y 2

Shift 1  2 = 50.0 Y
Shift 1  3 = 262.46  XY
Shift 2   1 = 0.603637713 A
Shift 2   2 = 0.368225686 B
Shift 2   3 = 0.430824437 r

a) Ho:   0 (The slope of the regression line is equal to zero)


Ha:   0 (The slope of the regression line is not equal to zero)
b) Critical region and Level of Significance: Let   0.05 be the level of significance,
t  t 0.025,18  2.101
thus  , n 2 
2

c) Computation:

Y  2
   X   Y  
Y 2

n
 b1  XY 
n

S 2x   
y n2

134.84 
 50.0 2   0.368 262.46  103 50.0 
20   20
  
20  2
8.015
  0.445
18

SSX   X 2

 X  2

 543.92 
 103
2
 13.47
n 20

22
S 2x
SE  b1   y 0.445
  0.0331  0.182
SSX 13.47

t
 b1  1   0.368  2.024
SE  b1  0.1818

d) Decision: Since the computed t-value is less than the tabulated t-value, there
is no sufficient evidence to reject the null hypothesis as shown in
the figure below.

2.024

0.025 0.025

-2.101 +2.101
f) Conclusion: Therefore, the slope of the regression line is not different from
zero. Meaning, the regression equation could not be used to estimate the
entrance score if the grade point average is given.

n XY   X  Y
r
n  X 2
  X 
2
n Y 2
  Y 
2

20 262.46  103 50 99.2
   0.431
20 543.92  103 20134.84   50.0 
2 2
53,017.92

a) Ho: There is no linear relationship between entrance exam score and grade
point average    0 
Ha: There is linear relationship between entrance exam score and grade
point average    0 

b) Critical region and Level of Significance: let   0.05 level of significance, then
the critical region is t   t0.025, 18   2.101
, n 2 
2

c) Computation:

t
r n2

 0.431 20  2  2.026
1 r 1   0.431
2 2

23
d) Decision: Since the computed t-value is less than the tabulated t-value,
therefore there is no enough evidence to reject the null
hypothesis.

2.026

0.025 0.025

-2.101 +2.101

e) Conclusion: Therefore, there is no enough evidence to conclude that


entrance exam score is linearly correlated to grade point
average. There is no significant linear relationship exist
between entrance exam score and grade point average.

Coefficient of Determination r2 = (0.1087)2 = 0.0118

Only 1.18% of the variation in entrance exam scores is explained by the total
variation in grade point average. The other 98.82% variations in entrance
examination scores are explained by the other factors not included in the study.

4. Muscle Mass. A person’s muscle mass is expected to decrease with age. To


explore this relationship in women, a nutritionist randomly selected four women from
each 10-year age group, beginning with age 40 and ending with age 79. The results
follow; X is age, and Y is a measure of muscle mass.

i. Formulate the regression equation


ii. Test the significance of the coefficients.
iii. Is there significant relationship between age and measure of
muscle mass at 0.05 level of significance?
iv. What is the expected muscle mass of the person if he is already
80 years old?
Summary Table (Computations)
X Y X2 Y2 XY
71 82 5041 6724 5822
64 91 4096 8281 5824
43 100 1849 10000 4300
67 68 4489 4624 4556
56 87 3136 7569 4872
73 73 5329 5329 5329
68 78 4624 6084 5304
56 80 3136 6400 4480

24
76 65 5776 4225 4940
65 84 4225 7056 5460
45 116 2025 13456 5220
58 76 3364 5776 4408
45 97 2025 9409 4365
53 100 2809 10000 5300
49 105 2401 11025 5145
78 77 6084 5929 6006
967 1379 60409 121887 81331

Summary:
N = 16 X  967 Y  1,379 X 2
 60,409
967
Y 2
 121,887  XY  81,331 X 
16
 60.4375

1,379
Y   86.1875
16

n XY   X  Y 16(81,331)  (967)(1,379)  32,197


b1     1.024
n X    X  16(60,409)  (967) 2
2
2 31,445

bo  Y  b1 X  86.1875  1.024 60.4375  148.0507

Regression Equation = Yˆ  bo  b1 X  148.0507  1.204 Xˆ

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
71 , 82 M+
64 , 91 M+

25
43 , 100 M+
67 , 68 M+
56 , 87 M+
73 , 73 M+
68 , 78 M+
56 , 80 M+
76 , 65 M+
65 , 84 M+
45 , 116 M+
58 , 76 M+
45 , 97 M+
53 , 100 M+
49 , 105 M+
78 , 77 M+

To obtain different statistic PRESS

Shift 1 1 = 60,409. X 2

Shift 1 2 = 967. X
Shift 1 3 = 16. n
Shift 2 1 = 60.4375 X

Shift 1  1 = 121,887. Y 2

Shift 1  2 = 1,379. Y
Shift 1  3 = 81,331.  XY
Shift 2   1 = 148.0506756 A
Shift 2   2 = -1.023589254 B
Shift 2   3 = -0.823894252 r

a) Ho:  o  0 (The slope of the regression line is equal to zero)


Ha:  o  0 (The slope of the regression line is not equal to zero)
b) Critical region and Level of Significance: Let   0.05 be the level of significance,
t  t 0.025,14  2.145
thus  , n2 
2

c) Computation:

26
Y  2
   X   Y  
Y 2

n
 b1  XY 
 n


S 2x 
y n2

121,887 
1,379 2  1.024 81,331   967 1,379  
16   
16 

16  2
973.8295
  69.5593
14

SSX   X 2

 X  2

 60,409 
 967  2  1,965.9375
n 16

S 2x
SE  b1   69.5593
y
  0.0354  0.1881
SSX 1,965.9375

t
 b1  1    1.024  5.4439
SE  b1  0.1881
d) Decision: Since the computed t-value is greater than the tabulated t-value,
there is enough evidence to reject the null hypothesis. The
computed t-value is located in the critical region as shown in the
figure below.

-2.145 +2.145
-5.4439

e) Conclusion: Therefore, the slope of the regression line is different from zero.

n XY   X  Y
r
n X 2
  X 
2
n Y 2
 Y 
2

16 81,331   9671,379  32,197
   0.8239
16 60,409   967 16121,887  1,379 
2 2
1,527,171,705

a) Ho: There is no linear relationship between age and measure of muscle


mass.    0 
Ha: There is no linear relationship between age and measure of muscle
mass.    0 

27
b) Critical region and Level of Significance: let   0.05 level of significance, then
the critical region is t   t0.025, 14   2.145
, n  2 
2

r n2   0.8239 16  2
c) Computation: t    5.4395
1 r 1    0.8239
2 2

d) Decision: Since the computed t-value is greater than the tabulated t-value,
therefore there is an evidence to reject the null hypothesis.

-5.4395 -2.145 +2.145

e) Conclusion: Therefore, there is a reason to believe that the measure of


muscle mass decreases as we grow older. The t-test revealed
that there is significant linear relationship (negative) exist
between age and measure of muscle mass.

4. An urban sociologist interested in neighborliness collected data for a sample of


10 adults on (X) how many years they have lived in their neighborhood and (Y)
how many of their neighbors they regard as friends. Compute the Pearson’s
correlation coefficient for their data and determine whether the correlation is
significant.

Summary Table (Computations)


X Y X2 Y2 XY
1 1 1 1 1
5 4 25 16 20
6 2 36 4 12
1 3 1 9 3
8 5 64 25 40
2 1 4 1 2
5 2 25 4 10
9 6 81 36 54
4 7 16 49 28
2 0 4 0 0
43 31 257 145 170

28
Summary:
N = 10 X  43 Y  31 X 2
 257
43 31
Y 2
 145  XY  170 X 
10
 4.3 Y 
10
 3.1

n XY   X  Y
r
n  X 2

   X  n Y 2    Y 
2 2

10170   43 31 367
   0.618
10 257   43 10145   31 
2 2
352,569

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
1 , 1 M+
5 , 4 M+
6 , 2 M+
1 , 3 M+
8 , 5 M+
2 , 1 M+
5 , 2 M+
9 , 6 M+
4 , 7 M+
2 , 0 M+

To obtain different statistic PRESS

Shift 1 1 = 257. X 2

29
Shift 1 2 = 43. X
Shift 1 3 = 10. n
Shift 2 1 = 4.3 X
Shift 1  1 = 145. Y 2

Shift 1  2 = 31. Y
Shift 1  3 = 170.  XY
Shift 2   3 = 0.61807902 r

a) Ho: There is no linear relationship between (X) how many years they have
lived in their neighborhood and (Y) how many of their neighbors they
regard as friends.    0
Ha: There is linear relationship between (X) how many years they have lived
in their neighborhood and (Y) how many of their neighbors they regard
as friends.    0 

b) Critical region and Level of Significance: let   0.10 level of significance, then
the critical region is t   t0.05, 8   1.86
, n 2 
2

r n2  0.618 10  2
c) Computation: t    2.2234
1 r 1   0.618
2 2

d) Decision: Since the computed t-value is greater than the tabulated t-value,
therefore there is an evidence to reject the null hypothesis.

-1.86 +1.86 2.2234

e) Conclusion: Therefore, there is a reason to believe that the there is linear


relationship between (X) how many years they have lived in
their neighborhood and (Y) how many of their neighbors they
regard as friends.. The t-test revealed that there is significant
positive linear relationship exist between (X) how many years
they have lived in their neighborhood and (Y) how many of
their neighbors they regard as friends.

6. A criminologist studying the relationship between population density and robbery rate
in medium-sized US cities collected the following data for a random sample of 16
cities; X is the population density of the city (number of people per unit area), and Y

30
is the robbery rate last year (number of robberies per 100,000 people). Assume that
the first-order regression model is appropriate.

a) Obtain the estimated regression function. Plot the estimated regression function
and the data. Does the linear regression function appear to give a good fit here?
Discuss.
b) Obtain point estimates of the mean robbery rate last year in cities with population
density X = 60.

Summary Table (Computations)


X Y X2 Y2 XY
59 209 3481 43681 12331
49 180 2401 32400 8820
75 195 5625 38025 14625
54 192 2916 36864 10368
78 215 6084 46225 16770
56 197 3136 38809 11032
60 208 3600 43264 12480
82 189 6724 35721 15498
69 213 4761 45369 14697
83 201 6889 40401 16683
88 214 7744 45796 18832
94 212 8836 44944 19928
47 205 2209 42025 9635
65 186 4225 34596 12090
89 200 7921 40000 17800
70 204 4900 41616 14280
1118 3220 81452 649736 225869

Summary:
N = 16 X  1118 Y  3220 X 2
 81,452
1,118
Y 2
 649,736  XY  225,869 X 
16
 69.875

3,220
Y   201.25
16

n XY   X  Y 16(225,869)  (1,118)(3,220) 13,944


b1     0.2616
n X    X  16(81,452)  (1,118) 2
2
2 53,308
bo  Y  b1 X  201.25  0.2616 69.875  182.9707

31
Regression Equation = Yˆ  bo  b1 X  182.9707  0.2616 Xˆ

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
59 , 209 M+
49 , 180 M+
75 , 195 M+
54 , 192 M+
78 , 215 M+
56 , 197 M+
60 , 208 M+
82 , 189 M+
69 , 213 M+
83 , 201 M+
88 , 214 M+
94 , 212 M+
47 , 205 M+
65 , 186 M+
89 , 200 M+
70 , 204 M+

To obtain different statistic PRESS

Shift 1 1 = 81,452. X 2

Shift 1 2 = 1,118. X
Shift 1 3 = 16. n

32
Shift 2 1 = 69.875 X
Shift 1  1 = 649,736. Y 2

Shift 1  2 = 3,220. Y
Shift 1  3 = 225,869.  XY
Shift 2   1 = 182.9724994 A
Shift 2   2 = 0.261574247 B
Shift 2   3 = 0.365011194 r

Scatter Plot

220
Robbery Rate per 100,000

215
210
205
200
195
190
185
180
175
0 20 40 60 80 100
Population Density

a) Ho:   0 (The slope of the regression line is equal to zero)


Ha:   0 (The slope of the regression line is not equal to zero)
b) Critical region and Level of Significance: Let   0.05 be the level of significance,
t  t 0.025,14  2.145
thus  , n  2 
2
c) Computation:

Y  2
   X   Y  
Y 2

n
 b1  XY 
 n


S 2x 
y n2

649,736 
 3,220
2

  0.2616 225,869 
1,118  3,220 
16  16
  
16  2
1,483.0156
  105.9296
14

33
SSX   X 2

 X  2

 81,452 
1,118 2  3,331.75
n 16

S 2x
SE  b1   105.9296
y
  0.03179  0.1783
SSX 3,331.75

t
 b1  1   0.2616  1.4672
SE  b1  0.1783
d) Decision: Accept the null hypothesis
e) Conclusion: Therefore the slope of regression line is not significantly
different from zero. Thus, the regression equation could not be
used to estimate the crime rate if the population density is given.

If X = 60 then Yˆ  182.9707  0.2616 60   198.67  199.0

The expected number of robberies per 100,000 is 199 if the number of


persons per unit area is 60. However, the test revealed that the slope of the
regression equation is not significantly different from zero. Thus, the regression
equation is not a good tool in estimating the crime rate if the population density
is given.

7. An economist is interested in studying the relationship between length of


unemployment and job-seeking activity among white-collar workers. He interviews a
sample of 12 unemployed accountants as the number of weeks they have been
unemployed (X) and seeking a job during the past year (Y). Compute a Pearson’s
correlation coefficient for these data and determine whether the correlation is
significant.

Summary Table (Computations)


X Y X2 Y2 XY
2 8 4 64 16
7 3 49 9 21
5 4 25 16 20
12 2 144 4 24
1 5 1 25 5
10 2 100 4 20
8 1 64 1 8
6 5 36 25 30
5 4 25 16 20

34
2 6 4 36 12
3 7 9 49 21
4 1 16 1 4
65 48 477 250 201

Summary:
N = 12 X  65 Y  48 X 2
 477
65 48
Y 2
 250  XY  201 X 
12
 5.4167 Y 
12
 4.0

n XY   X  Y 12 201   65 48


r 
n  X 2
  X 
2
n Y 2
 Y 
2
 12 477   65 2 12 250   48 2 
 708
  0.6932
1,043,304

The length of unemployment is inversely correlated with job-seeking activity


among white-collar workers. The degree of association between the dependent and
independent variables are moderately correlated.

Use of Calculator (Casio S-V.P.A.M fx-991MS)

PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
2 , 8 M+
7 , 3 M+
5 , 4 M+
12 , 2 M+
1 , 5 M+
10 , 2 M+
8 , 1 M+

35
6 , 5 M+
5 , 4 M+
2 , 6 M+
3 , 7 M+
4 , 1 M+

To obtain different statistic PRESS

Shift 1 1 = 477. X 2

Shift 1 2 = 65. X
Shift 1 3 = 12. n
Shift 2 1 = 5.416666667 X
Shift 1  1 = 250. Y 2

Shift 1  2 = 48. Y
Shift 1  3 = 201.  XY
Shift 2   3 = -0.693150947 r

a) Ho: There is no linear relationship between length of unemployment and job-seeking


activity among white-collar workers    0
Ha: There is linear relationship between length of unemployment and job-seeking
activity among white-collar workers    0 
b) Critical region and Level of Significance: let   0.05 level of significance, then the
critical region is t   t0.025, 10   2.228
, n 2 
2
c) Computation:

t
r n2

  0.6932 12  2  3.041
1 r 1    0.6932
2 2

d) Decision: The computed t-value is greater than the tabulated t-value thus there
is an evidence to reject the null hypothesis. The computed t-value is located on
the critical region as shown in the figure below.

3.04
1

-2.228 +2.228
e) Conclusion: Therefore, there is linear relationship between length of
unemployment and job-seeking activity among white-collar

36
workers. The two variables are moderately and inversely
correlated.

8. In preparing for an examination, some students in a class studied more than others.
Each student’s grade on the 10-point exam and the number of hours studied are listed
as follows:

Student Hours Studied Exam Grade


Amy 4 5
Ajace 1 2
Dianne 3 1
Owell 5 5
Charles 8 9
Ryann 2 7
Sanford 7 6
Joemylou 6 8

Calculate a Pearson’s correlation coefficient and determine whether the


correlation is significant.

Summary Table (Computations)


X Y X2 Y2 XY
4 5 16 25 20
1 2 1 4 2
3 1 9 1 3
5 5 25 25 25
8 9 64 81 72
2 7 4 49 14
7 6 49 36 42
6 8 36 64 48
36 43 204 285 226

Summary:
N=8 X  36 Y  43 X 2
 204
36
Y 2
 285  XY  _226 X 
8
 4.5
43
Y   5.375
8

n XY   X  Y  8 226   36 43


r   0.683
n  X 2

   X  n Y    Y 
2 2 2
  8 204   36 2  8 285   43 2 
Use of Calculator (Casio S-V.P.A.M fx-991MS)

37
PRESS
MODE MODE 2 1
The calculator is ready to perform Linear Regression Function. Again, before we
start to input data PRESS
Shift Clr 1 = AC
to ensure that no other data is present in the calculator memory.
Data Input:

Data is inputted, one by one, by keying in the datum X first, followed by a comma
and the Y value, and pressing the DT (M+) button

PRESS
4 , 5 M+
1 , 2 M+
3 , 1 M+
5 , 5 M+
8 , 9 M+
2 , 7 M+
7 , 6 M+
6 , 8 M+

To obtain different statistic PRESS

Shift 1 1 = 204. X 2

Shift 1 2 = 36. X
Shift 1 3 = 8. n
Shift 2 1 = 4.5 X
Shift 1  1 = 285. Y 2

Shift 1  2 = 43. Y
Shift 1  3 = 226.  XY
Shift 2   3 = 0.683227084 r

a) Ho: There is no linear relationship between exam grade and number of hours
studied    0 
Ha: There is linear relationship between exam grade and number of hours
studied    0 

38
b) Critical region and Level of Significance: let   0.05 level of significance, then
t  t 0.025, 6   2.447
the critical region is  , n  2 
2
c) Computation:

t
r n2

 0.683 8  2  2.292
1 r 1   0.683
2 2

d) Decision: Since the computed t-value is less than the tabulated t-value,
therefore there is no enough evidence to reject the null
hypothesis.
e) Conclusion: Therefore, there is no enough evidence to conclude that exam
grade is linearly correlated to the number of hours studied.
There is no significant linear relationship exist between exam
grade and number of hours studied.

39

You might also like