0% found this document useful (0 votes)
18 views6 pages

Regression Analysis

The document discusses regression analysis, focusing on constructing a prediction model using a dataset with one output and multiple inputs, and minimizing errors through linear relationships. It outlines the process of estimating coefficients, validating the model using F-tests and t-tests, and includes assignments for practical application of the concepts. Additionally, it provides examples and data sets for fitting multiple linear regression equations and testing hypotheses.

Uploaded by

CCPCCP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

Regression Analysis

The document discusses regression analysis, focusing on constructing a prediction model using a dataset with one output and multiple inputs, and minimizing errors through linear relationships. It outlines the process of estimating coefficients, validating the model using F-tests and t-tests, and includes assignments for practical application of the concepts. Additionally, it provides examples and data sets for fitting multiple linear regression equations and testing hypotheses.

Uploaded by

CCPCCP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic

ISE Department Lecturer: Phan Nguyễn Kỳ Phúc


--------------------o0o------------------

Regression Analysis
Problem given a data set of one output and multiple inputs

Y 72 76 78 70 68 80 82 65 62 90

X1 12 11 15 10 11 16 14 8 8 18

X2 5 8 6 5 3 9 12 4 3 10

Assume that the relationship between output and inputs are linear, so it can be expressed as

Y  0   i X i  
k

i 1

where   N (0,  2 )  can be interpreted as noise. In this model   0 , 1 ,  2 ...,  k  are unknown.
So our objective is to reconstruct a prediction model of Y which creates the minimum errors
based on the given data set.

Assume that our forecasting model is

Y  0   i X i
k

i 1

So the error square between the forecasting model and the output of a record is

error  (Y  Y )2   0   i X i  Y 


 
k 2

 
2

i 1

For example, for the 1st record  Y  72, X 1  12, X 2  5  the error square is


error 2  (Y  Y ) 2  0  12
  5  72

2
1 2

Since we consider all the data set, so the error square must be summed for all records

L   error   (Y  Y )2    0   i X i  Y  where n is the number of records in dataset.


 
k 2

n  
2

n n i 1

For the above data set, n=10.

10 Regression Analysis 1
INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic
ISE Department Lecturer: Phan Nguyễn Kỳ Phúc
--------------------o0o------------------

 2 0   0   i X i  Y    2 X i  0   i X i  Y 


L  k
 L  k


 n  i 1  
  i 1 

 0    0   i X i  Y   0  0   X i  0   
L   L
n

 X Y   0
0 i


k k

  n  i 1     i 1 
i i
0 i n

So to minimize the error square, we take the first derivative corresponding to i and set it equal
to zero. Solve these linear equation systems we can obtains values of i

With the above example

o  47.1649, 1  1.5990, 2 = 1.1487

The next two problems that we concern are

 Whether this regression model is valid (good enough to explain the data)
 Whether we can exclude some inputs, i.e., simplify current model but still can explain the
data

To answer the 1st concern we use the F-test

ANOVA Table for Multiple Regression

Source of Variation Sum of Squares Dof Mean Square F Ratio

MSR  F
SSR MSR
Regression SSR k
k MSE
n  (k  1) MSE 
SSE
n  (k  1)
Error SSE

Total SST n 1
n: number of data records

k: number of inputs

R2  1  ; R  1
SSE 2 MSE
SST MST

   
In the above example k=2, n=10.

 y  y  yi    yi  y
n n n

2 2 2
yi
i 1 i 1 i 1
i

SST  SSE  SSR


(Total sum of squares )  (Sum of squares for error )  (Sum of squares for regression)

10 Regression Analysis 2
INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic
ISE Department Lecturer: Phan Nguyễn Kỳ Phúc
--------------------o0o------------------

Another approach

The response can be expressed under the matrix form as

 0 
 
 Y 1   1 X 11 ... X k1    
    
X 21
 ...    ... ... ... ...    
1

Y n   1 X 1n ... X kn   ... 
...
  
2

 
X 2n

 k

where the superscript denote the index of the record. The response can be expressed as

Y  X

To find 
 , following formula is applied

( X ' X )-1 X ' Y  


 ,

where X’ is the transpose matrix of X.

The covariance matrix of 


 is given as follows:

 )   2 ( X ' X )1 where  2  MSE


Cov( 

The square roots of the main diagonal elements of this matrix are the standard errors of the
model parameters

ANOVA table

Source Dof SS MS F p
Regression 2 630.54 315.27 86.34 0.000
Error 7 25.56 3.67
Total 9 656.10
R-sq = 96.1% R-sq(adj) = 95.0%

10 Regression Analysis 3
INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic
ISE Department Lecturer: Phan Nguyễn Kỳ Phúc
--------------------o0o------------------

To answer the 2nd concern we use the t-test for each coefficient.

In t-test for coefficient i the hypothesis testing is: H 0 : i  0, H1 : i  0 . When we running the
software, for the above data set, we obtain

Predictor Coef Stdev t-ratio p


Constant 47.165 2.470 19.09 0.000
X1 1.5990 0.2810 5.69 0.000
X2 1.1487 0.3052 3.76 0.007

To look up the value of t-table we use the dof of SSE

Finding the confident interval for the output


Given an input x, we need to find the confident for the output Y(x)

The 100(1-α) percent confidence Y(x) will lie between

 x   1  x '( X ' X ) 1 x  t /2,n  k 1


k
SS R
i 0 n  k 1
i i

10 Regression Analysis 4
INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic
ISE Department Lecturer: Phan Nguyễn Kỳ Phúc
--------------------o0o------------------

Assignments

Question 1: The following data indicate the gain in reading speed versus the number of weeks in
the program of 10 students in a speed-reading program.

weeks 2 3 8 11 4 5 9 7 5 7
Speed 21 42 102 130 52 57 105 85 62 90
1. Plot a scatter diagram to see if a linear relationship is indicated.

2. Find the least squares estimates of the regression coefficients.

3. Estimate the expected gain of a student who plans to take the program for 7 weeks.

Question 2: The following data set presents the heights of 12 male law school classmates whose
law school examination scores were roughly equal. It also gives their first year salaries. Each of
them went into corporate law. The height is in inches and the salary in units of $1,000.

Height 64 65 66 67 69 70 72 72 74 74 75 76
Salary 91 94 88 103 77 96 105 88 122 102 90 114

1. Do the above data establish the hypothesis that a lawyer’s salary is related to his height? Use
the 5 percent level of significance.

2. What was the null hypothesis in part 1?

Question 3: Fit a multiple linear regression equation to the following data set.

X1 X2 X3 X4 Y
1 11 16 4 275
2 10 9 3 183
3 9 4 2 140
4 8 1 1 82
5 7 2 1 97
6 6 1 -1 122
7 5 4 -2 146
8 4 9 -3 246
9 3 16 -4 359
19 2 25 -5 482
Question 4:

1. Fit a multiple linear regression equation to the following data set.

2. Test the hypothesis that β0 = 0.

3. Test the hypothesis that β3 = 0.

4. Test the hypothesis that the mean response at the input levels x1 = x2 = x3 = 1 is 8.5.

10 Regression Analysis 5
INTERNATIONAL UNIVERSITY (IU) Engineering Probability & Statistic
ISE Department Lecturer: Phan Nguyễn Kỳ Phúc
--------------------o0o------------------

X1 X2 X3 Y
7.1 0.68 4 41.53
9.9 0.64 1 63.75
3.6 0.58 1 16.38
9.3 0.21 3 45.54
2.3 0.89 5 15.52
4.6 0.00 8 28.55
0.2 0.37 5 5.65
5.4 0.11 3 25.02
8.2 0.87 4 52.49
7.1 0.00 6 38.05
4.7 0.76 0 30.76
5.4 0.87 8 39.69
1.7 0.52 1 17.59
1.9 0.31 3 13.22
9.2 0.19 5 50.98

Question 5: The cost of producing power per kilowatt hour is a function of the load factor and
the cost of coal in cents per million Btu. The following data were obtained from 12 mills.

Loaf Cost X3
84 14 4.1
81 16 4.4
73 22 5.6
74 24 5.1
67 20 5.0
87 29 5.3
77 26 5.4
76 15 4.8
69 29 6.1
82 24 5.5
90 25 4.7
88 13 3.9
1. Estimate the relationship.

2. Test the hypothesis that the coefficient of the load factor is equal to 0.

3. Determine a 95 percent prediction interval for the power cost when the load factor is 85 and
the coal cost is 20.

10 Regression Analysis 6

You might also like