0% found this document useful (0 votes)
154 views24 pages

Chapter 1

This document outlines the contents and chapters of the course BMMS2074 Statistics for Data Science. It covers topics such as simple and multiple linear regression, decision analysis, introduction to time series, descriptive techniques, time series theory, estimation of time series models, and forecasting. It provides an example of using simple linear regression to analyze the relationship between college GPA and high school GPA. Regression allows us to use one variable (high school GPA) to predict another (college GPA) by fitting a linear model to the sample data. The chapter introduces concepts like the regression line and scatterplots to visualize relationships between variables.

Uploaded by

Pei Pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views24 pages

Chapter 1

This document outlines the contents and chapters of the course BMMS2074 Statistics for Data Science. It covers topics such as simple and multiple linear regression, decision analysis, introduction to time series, descriptive techniques, time series theory, estimation of time series models, and forecasting. It provides an example of using simple linear regression to analyze the relationship between college GPA and high school GPA. Regression allows us to use one variable (high school GPA) to predict another (college GPA) by fitting a linear model to the sample data. The chapter introduces concepts like the regression line and scatterplots to visualize relationships between variables.

Uploaded by

Pei Pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

BMMS2074 Statistics for Data Science

BMMS2074 Statistics for Data Science

Contents:
Chapter 1. Simple Linear Regression

Chapter 2. Multiple Linear Regression

Chapter 3. Decision Analysis

Chapter 4. Introduction to Time Series

Chapter 5. Simple Descriptive Techniques

Chapter 6. Time Series Theory

Chapter 7. Estimation of Time Series Models

Chapter 8. Forecasting

Reference Books:
1. Wackerly, D. D., Mendenhall, W. & Scheaffer, R. L. (2008). Mathematical Statistics
with Applications. (7th ed.). Thomson.
2. Ken Black (2013). Applied Business Statistics: Making Better Business Decisions. (7th
ed.). John Wiley.
3. Chris Chatfield. 2004. The Analysis of Time Series : An Introduction. 6th Edition.
Chapman & Hall.

Chapter 1 - 1
BMMS2074 Statistics for Data Science

Chapter 1 : Simple Linear Regression

1.1 Introduction

Suppose we are interested in estimating the average GPA of all students at TAR UC. How
would we do this? (Assume we do not have access to any student records.)

Define the population:


All TAR UC students

Define the parameter of interest:


Let  be the population mean of the GPA of all TAR UC students.

Take a representative sample from the population:


Suppose a random sample of 100 students is selected and the GPA of each student
are recorded and let Y denote the GPA of all TAR UC students.

Calculate the sample statistic that estimates the parameter:


100
 yi
i =1
y= which is the sample mean computed to estimate  .
100

Make an inference about the value of the parameter using statistics:


Construct confidence intervals or perform hypothesis tests using the sample mean
and sample standard deviation computed.

The diagram below demonstrates these steps. Note that not all GPAs could be shown in the
diagram.

Take Sample

3.6
2.4
2.7
2.8
2.9 Inference Population
Sample 3.9
3.2
3.4 2.8 3.4
2.9
3.6
1.2
4.0
Y

What factors may be related to GPA?


High school GPA
Rank in high school class
Involvement in activities
High school overall rating
Etc.
Suppose we are interested in the relationship between college and HS GPA and we want to
use HS GPA to predict college GPA. How could we do this?

Chapter 1 - 2
BMMS2074 Statistics for Data Science

Use similar steps as on page 1, but now with regression models.

Take Sample

(2.8, 3.6)
(2.2, 2.4)
(2.7, 2.6)
(2.9, 2.8)
(3.0, 2.9) Inference Population
Sample (4.0, 3.9) (3.1, 3.2)
(2.9, 2.8) (2.2, 3.4)
(3.0, 2.9)
(2.8, 3.6)
(2.2, 3.4) (2.2, 1.2)
(3.8, 4.0)

Data shown as: (HS GPA, College GPA)

1.1.1 Scatterplot

The main objective of this chapter is to analyze a collection of paired sample data (or
bivariate data) and determine whether there appears to be a relationship between the two
variables.

A set of bivariate data is denoted as ( x1 , y1 ), ( x2 , y2 ), ... , ( xn , yn ).

A correlation exists between two variables when one of them is related to the other in some
way.

A scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are plotted
with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is plotted as a single
point.

Example
Suppose we take a sample of seven households and collect information on their incomes and
food expenditures for the past month. The information obtained (in hundreds of RM) is given
below.

Income (hundreds) 35 49 21 39 15 28 25
Food expenditure (hundreds) 9 15 7 11 5 8 9

Solution
The scatter diagram for this set of data is
16
14
12
10
8
6
4
10 20 30 40 50

Chapter 1 - 3
BMMS2074 Statistics for Data Science

1.2 Simple Linear Regression Model


The response or dependent variable is the response of interest and is usually denoted by Y .

The explanatory, independent or predictor variable attempts to explain the response and is
usually denoted by X .

A scatter plot shows the relationship between two quantitative variables X and Y . The
values of the X variable are marked on the horizontal axis, and the values of the Y variable
are marked on the vertical axis. Each pair of observations ( xi , y i ) is represented as a point in
the plot.

Two variables are said to be positively associated if, as X increases, the value of Y tends to
increase. Two variables are said to be negatively associated if, as X increases, the value of
Y tends to decrease.

Interpretation:
Form, Direction, Strength, Any Deviations

Figure B: Negative Association


Figure A: Positive Association

Figure D: No Linear Association

100
90
80
70
60
50
40
30
10 20 30 40 50

Figure A: shows a moderately strong, positive, linear relationship.

Figure B: shows a strong, negative, slightly curved to linear relationship.

Figure C: shows no association between the two variables.

Figure D: shows a very strong association, not a linear one, but rather more quadratic
(curvilinear).

Chapter 1 - 4
BMMS2074 Statistics for Data Science

Example 1.1
A random sample of UTARUC students is taken producing the data set below.
Student X (HS GPA) Y (College GPA)
1 x1=3.04 y1=3.10
2 x2=2.35 y2=2.30
3 2.70 3.00
. . .
. . .
. . .
18 4.00 3.80
19 2.28 2.20
20 1.88 1.60

Scatter plot of the data:

College GPA vs. HS GPA

4.00
3.50
3.00
Y (College GPA)

2.50
2.00
1.50
1.00
0.50
0.00
0.00 1.00 2.00 3.00 4.00 5.00
X (HS GPA)

It shows fairly strong positive linear association between College GPA and HS GPA.

A functional relation between two variables is expressed by a mathematical formula. If X


denotes the independent variable and Y the dependent variable, a functional relation is of the
form: Y = f (X ) . Given a particular value of X , the function f indicates the corresponding
value of Y . All values fall directly on the line of functional relationship.

Statistical relation (Regression) between two variables is not a perfect fit. In general, the
observations do not fall directly on the curve of relationship.

Example:
Consider the relation between dollar sales ( Y ) of a product sold at a fixed price and number
of units sold ( X ). If the selling price is RM2 per unit, the relation is expressed by the
equation:
Y = 2X
Number of Units Sold, X Sales, Y (RM)
75 150
25 50
130 260
Chapter 1 - 5
BMMS2074 Statistics for Data Science

Example:
Performance evaluations for 23 employees were obtained at midyear (0 – 10 scale) and at
year-end (0 – 400 points). These data are plotted in the following figure.

The figure clearly suggests that there is a positive linear relation between midyear and year–
end evaluation. However, the relation is not a perfect fit. The scattering of the points
suggesting that some of the evaluations is not accounted for by midyear performance
assessments. For instance, two employees had midyear evaluation of x = 4 , and yet they
received different year–end evaluation.

1.2.1 Formal Statement of Model

Suppose you are interested in studying the relationship between two variables X and Y .
Take Sample

Inf erence Population


Sam ple

ˆ i = ?0 + 1x i
y
yi =  0 + 1 xi +  i

The population model can be stated as follows:

yi =  0 + 1 xi +  i
where:
(i) y i is the value of the response (dependant) variable in the ith trial/observation.
(ii) The regressor x i is a known constant (fix). (i.e. the value in the predictor
(independent) variable in the ith trial).
(iii) The intercept  0 and the slope  1 are unknown constants (parameters).
(iv)  i is the random error

Assumptions
1. The error terms  i are normally and independently distributed with E ( i ) = 0 and
constant variance Var ( i ) =  2 .
2. The error (thus, the y i , also) are uncorrelated with each other.
  i ~ NID (0,  2 )
3. E(Y | x) =  0 + 1 x ; Var(Y | x) =  2

Chapter 1 - 6
BMMS2074 Statistics for Data Science

Note:
The above model is said to be simple, linear in the parameters, and linear in the
predictor variables.
It is “simple” in that there is only one predictor variable, “linear in the parameters”
because no parameter appears as an exponent or is multiplied or divided by another
parameter, and “linear in the predictor variable,” because this predictor variables
appears only in the first power.
A model that is linear in the parameters and in the predictor variable is also called a
first–order model.

The parameters  0 and  1 are unknown and can be estimated using n pairs of sample data
( x1 , y1 ), ( x2 , y 2 ), ... , ( xn , y n ) .

Population
Linear Regression Model
Y Yi i= 0+0+
1X1Xi i + i
i+
Observed
Value

i= Random Error

 YX =  + 1X i
0
E(Y)= 0 +  1X

X
Observed Value

Sample
Linear Regression Model
Y = 0b+0+
YYi i= i
1Xb1X
i+ ei

ei

Unsampled Value
 = bˆ + bˆ X
i = 00 + 1
YŶ
1X i

X
Sampled Value

Chapter 1 - 7
BMMS2074 Statistics for Data Science

1.3 Least Square Estimation of the parameters

The line that minimizes the sum of squares of the deviations of observed values of y i from
those predicted is the best–fitting line.

The least–squares (LS) criterion is defined as

( ) ( )
n
S ˆ0 , ˆ1 =  ei2 =  ( yi − yˆ ) =  yi − ˆ0 − ˆ1 xi
2 2

i =1
which is the error sum of squares.

As discussed, the best–fitted line is that one which minimized LS, that is
S S
=0 and =0
ˆ 0 ˆ1

The solution for the LS estimators, ̂ 0 and ˆ1 , is:


n   n  n 
n  xi y i  −   xi   y i 
ˆ0 = y − ˆ1 x and ˆ1 =  i =1   i =1  i =1 
2
n  n 
n  xi2  −   xi 
 i =1   i =1 

 (xi − x )( yi − y )
n

S XY
= i =1 =
 ( xi − x )
n 2 S XX
i =1

where S XX =  ( x − x ) 2 =  x 2 − ( x) and S XY =  ( x − x )( y − y ) =  xy − ( x)( y) .


2

n n

Note that S YY =  ( y − y ) 2 =  y 2 − (  y ) (will be used later.)


2

A test of the second partial derivatives will show that a minimum is obtained with the least
squares estimators of ̂ 0 and ˆ1 .
Thus, the fitted simple linear regression model (estimated regression equation or line) is
yˆ = ˆ0 + ˆ1 x
Chapter 1 - 8
BMMS2074 Statistics for Data Science

Properties of LSE:
 (x i − x )( y i − y )  (x i − x ) y i −  (x i − x ) y  (x i − x ) y i
n n n n

(a) It can be shown that ̂ = i =1 = i =1 i =1


= i =1
=  k i yi
1
 (x i − x )  (x i − x )  (x i − x )
n 2 n 2 n 2
i =1 i =1 i =1

where k = xi − x . Since the k i are known constants, is ˆ1 a linear combination of


 (xi − x )
i n 2

i =1

the y i and hence is a linear estimator. In the similar fashion, it can be shown that ̂ 0
is a linear estimator as well.

(b) Some interesting properties of coefficient k i :


(i)  ki = 0
(ii)  k i xi = 1
1 1
 ki = =
2
(iii)
 ( xi − x )
2
S XX
(c) The LSEs are unbiased estimators of the parameters  0 and  1 .
E ( ˆ0 ) =  0 E ( ˆ1 ) = 1

(d) The variance of the LSEs are


ˆ 2
Var ( 1 ) =
S XX
 1 x2 
Var( ˆ0 ) = Var( y − ˆ1 x ) =  2  + 

 n S XX 
(e) Gauss–Markov Theorem stated that:
Under the conditions of regression, the least squares estimators ̂ 0 and ˆ1 are
unbiased and have minimum variance among all unbiased linear estimators. i.e.
BLUE –Best Linear unbiased Estimators.

i. ̂ 0 is the BLUE of  0 .
ii. ˆ1 is the BLUE of  1 .
iii. c1ˆ0 + c2 ˆ1 is the BLUE of c1 0 + c2 1 .

It can also be shown that ŷ is the BLUE of E (Y )

Properties of the Fitted Regression Model


1. The difference between the observed value y i and the corresponding fitted value ŷ is
a residual:
ei = y i − yˆ i = yi − ( ˆ0 + ˆ1 xi )
and the sum of the residual is always zero:
n
 ei = 0 .
i =1
2. The LS regression line always passes through the centroid ( x , y ) of the data.
Chapter 1 - 9
BMMS2074 Statistics for Data Science

3. The sum of the observed values y i equals the sum of the fitted values ŷ i .
n n
 yi =  yˆ i
i =1 i =1
4. The sum of the residuals weighted by the corresponding value of the regressor
variable always equals zero.
n
 xi ei = 0
i =1
5. The sum of the residuals weighted by the corresponding fitted values always equals
zero.
n
 yˆ i ei = 0
i =1

Example 1.2
What is the relationship between sales and advertising costs for a company?

Let X be the advertising costs in RM100,000.


Let Y be the sales units (in 10,000 units)
Assume monthly data below and independence between monthly sales.

x y x2 y2 xy
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4

 The regression line is


The corresponding scatterplot:

Sales vs. Advertising


5
4
3
Sales

2
1
0

0 1 2 3 4 5 6

Advertising

Chapter 1 - 10
BMMS2074 Statistics for Data Science

Example 1.3
(a) What do the estimated parameters in Ex. 1.2 mean?
(b) What are the estimated sales when the advertising cost is RM100,000 and
RM250,000, respectively?

Extrapolation is using the regression line to predict the value of a response corresponding to
a x value that is outside the range of the data used to determine the regression line.
Extrapolation can lead to unreliable predications

Example 1.4: Childhood Growth


The growth of children from early childhood through adolescence generally follows a linear
pattern. Data on the heights of female during childhood, from four to nine years old, were
compiled and the least squares regression line was obtained as yˆ = 31.496 + 2.3622x , where
Y denotes height in inches and X denotes age in years.
(a) Interpret the value of the estimated slope ˆ1 = 2.3622.
(b) Would interpretation of the value of the estimated Y-intercept, ̂ 0 = 31.496, make
sense here? If yes, interpret it. If no, explain why not.
(c) What would you predict the height to be for a female at 8 years old?
(d) What would you predict the height to be for a female at 25 years old?
(e) Why do you think your answer to part (d) was so inaccurate?

Chapter 1 - 11
BMMS2074 Statistics for Data Science

1.4 Estimating  2
Population simple linear regression model:
yi = 0 + 1xi +  i where  i ~ NID (0,  2 )

 2 measures the variability of the  i .

 2 can be estimated based on the residual or error sum of squares where


n n
SS E =  ei2 =  ( yi − yˆ i ) 2
i =1 i =1
n n
 ei  ( yi − yˆ i )
2 2
i =1 i =1
̂ 2 = MS E = =
n−2 n−2
This unbiased estimator of  is called the residual mean square and its square root is
2

called standard error of regression.

Note:
(a) SS E has n − 2 degrees of freedom associated with it. Two degrees of freedom are lost
due to the estimation of ̂ 0 and ˆ (remember that yˆ = ˆ + ˆ x ).
1 i 0 1 i

(b) SS E = S YY − ̂1 S XY

Example 1.5: Sales and Advertising

xi yi yˆ i = −0.1 + 0.7 x ei = ( yi − yˆi ) ei = ( yi − yˆi )


2

1 1 0.6 0.4 0.16


2 1 1.3 –0.3 0.09
3 2 2.0 0 0
4 2 2.7 –0.7 0.49
5 4

1.5 Correlation

The linear correlation coefficient, r (or R ), (is also called the Pearson product moment
correlation coefficient) measures the strength of the linear relationship between the paired x-
and y-quantitative values in a sample. It describes the direction of the linear association and
indicates how closely the points in a scatter plot are to the least squares regression line

1.5.1 The formula

S XY n( xi yi ) − ( xi )( yi )
r= =
S XX SYY ( ) (
n  xi2 − ( xi )2 n  yi2 − ( yi )2 )
Chapter 1 - 12
BMMS2074 Statistics for Data Science

Properties of the linear correlation coefficient, r ,


1. The value of r is always between –1 and 1 inclusive. That is − 1  r  1 .
2. r measures the strength of a linear relationship. It is not designed to measure the
strength of a relationship that is not linear.
3. If r = 0 , then there is no linear relationship between the two variables
4. 0  r2  1

Degree of correlation Positive correlation Negative correlation


Perfect +1 −1
Strong 0.8  r  1.0 − 1.0  r  −0.8
Moderate 0.4  r  0.8 − 0.8  r  −0.4
Weak 0  r  0.4 − 0.4  r  0
Absent 0 0

Scatter diagrams and correlation

Y Y
• • ••
Y •
• • • • •• •
•• •
• • • • •• •
• •
X X X
No relationship positive perfect positive
linear correlation linear correlation

Y Y Y
• • • • •
• • • • •
• • • • •
• • ••
X X X
Non−linear negative perfect negative
relationship linear correlation linear correlation

Example 1.6:
Graph A: ___________ Graph B: ___________
y y

x x
Graph C: ___________ Graph D: ___________
y y

x x

Chapter 1 - 13
BMMS2074 Statistics for Data Science

Example 1.7:
Compute the correlation coefficient r for Test 1 versus Test 2
x y x2 y2 xy
8 9 64 81 72
10 13

12 14 144 196 168


14 15 196 225 210
16 19 256 361 304
 60 70 1032 884

1.5.3 Relationship between r and the slope


s 
ˆ1 = r  Y 
 sX 
ˆ0 = Y − 1 X

y
Least Squares Regression Line

slope = b
y *
passes through this point

x
x

Note: The least squares regression line always passes through the point ( x , y ) .

Example 1.8:
The scores on the midterm and final exam for 500 students were obtained. The possible
values for each exam are between 0 and 100. The least squares regression line for predicting
the final exam from the midterm exam was obtained for these data. Suppose the correlation
coefficient is 0.5 for these data, r = 0.5 .
Susan, a student in this class, received a midterm score that was one standard deviation above
the average midterm score. Suppose the average and standard deviation for the midterm
scores were 80 and 10, respectively. Also suppose that the average and standard deviation
for the final exam scores were 60 and 20, respectively.

Predict Susan’s final exam score

Chapter 1 - 14
BMMS2074 Statistics for Data Science

1.6 Assumptions of the error


From yi = 0 + 1xi +  i where  i ~ NID (0,  2 )

This implies that


1. The y i ’s are independent
2. E( yi ) =  0 + 1 xi
3. Var ( y i ) =  2
4. y i follow a normal distribution

Since x i 0, and 1 are assumed to be constant in the regression model.


Thus for a particular x i value:

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Yi
22 23 24 25 26 27
 0+ 1Xi

Note:
1) There is a probability distribution of Y for each level of X .
2) The means of these probability distributions vary in some systematic fashion with X .
3) All the probability distributions of y i exhibit the same variability,  2 , in conformance
with the assumptions of simple regression model .

Thus, the response Yi , when the level of X in the ith trial is X i , comes from a probability
distribution whose mean is:
E(Yi ) =  0 + 1 X i

Chapter 1 - 15
BMMS2074 Statistics for Data Science

1.7 Inferences concerning  1 and  0


1.7.1 The sampling distribution for ˆ1

Population regression model: yi = 0 + 1xi +  i


From the assumption that  i ~ NID (0,  2 ) , we have y i ~ NID (  0 + 1 xi ,  2 )

2
Since ˆ1 =  k i y i ~ NID( 1 , )
S XX
To test the hypothesis that the slope equals a constant, we have H 0 : 1 = 10 and the test
ˆ − 10
statistic is z = 1 ~ N (0, 1) .
2
S XX
If  2 is unknown and the unbiased estimator MS E and the test statistic becomes
ˆ1 − 10
t= ~ t (n − 2)
MS E
S XX

The term se( ˆ1 ) =


MS E
is the (estimated) standard error of ˆ1 .
S XX

1.7.2 Special Case: Testing Significance of Regression


Test for
H 0 : 1 = 0
H1 : 1  0

Suppose 1 = 0 , then yi =  0 + 0.xi +  i =  0 +  i

Example plot: Suppose  0 = 3 .

Yi=3 + 0*X + i
5

4
Y

0 =3
2
E(Y)=3 + 0*X

0
0 2 4 6 8

Note:
1. The hypothesis testing can be done via (i) t − test or z − test, or (ii) analysis of
variance (ANOVA)
Chapter 1 - 16
BMMS2074 Statistics for Data Science

2. Accept null hypothesis H 0 indicate that there is no linear relationship between X


and Y which means
(i) X is of little value in explaining the variation of Y
(ii) The true relationship between X and Y is not linear.
3. Reject null hypothesis H 0 indicate that
(i) X is of value in explaining the variation of Y
(ii) The straight-line model is adequate or better results could be obtained with
addition of higher order polynomial terms in X .

Example 1.9:
For Ex. 1.2, Is advertising linearly related to sales? Use  = 0.05 .

1.7.3 Hypothesis test on  0

To test the hypothesis that the intercept equals a constant, we have H 0 :  0 =  00 and the test
̂ 0 −  00
statistic is t = ~ t (n − 2) ,
 1 x2 
MS E  + 

 n S XX 
 1 x2 
The term se( ˆ 0 ) = MS E  +  is the (estimated) standard error of ̂ 0 .

 n S XX 

Example 1.10:
Refer to Ex.1.7, Test 1 vs Test 2.
(a) What is the regression line that relate Test 1 to Test 2.
(b) Is there sufficient evidence to conclude that a linear relationship exists between Test 1
and Test 2? Use  = 0.05 .
(c) Test whether there is a direct(positive) relationship between Test 1 and Test 2. Use
 = 0.05 .
(d) The following test has no practical significance in this problem. Test whether the
intercept is zero.

Chapter 1 - 17
BMMS2074 Statistics for Data Science

1.7.4 Interval Estimation in Simple Linear Regression

(1 −  )100% confidence interval (C.I.) for  1 is


ˆ1 − t / 2;n−2 se( ˆ1 )  1  ˆ1 + t / 2;n−2 se( ˆ1 )

(1 −  )100% confidence interval (C.I.) for  0 is


ˆ − t0 se( ˆ )    ˆ + t
 / 2; n − 2 0 0 0
ˆ
 / 2;n − 2 se(  0 )

(1 −  )100% confidence interval (C.I.) for  2 is


( n − 2)MS E ( n − 2)MS E
  2

 2 / 2;n−2 12− / 2;n−2

Example 1.11:
At a used car dealership, let X be an independent variable representing the age in years of a
motorcycle and Y be the dependent variable representing the selling price of a motorcycle.
Find a 95% confidence interval for  1 .

xi yixi 2 yi 2 xi y i ( xi − x ) 2 ( y i − yˆ ) 2
5 500 25 250000 2500 38.44 1367.52
10 400 100 160000 4000 1.44 2923.56
12 300 144 90000 3600 0.64 929.64
14 200 196 40000 2800 7.84 47.75
15 100 225 10000 1500 14.44 3011.81
 56 1500 690 550000 14400 62.8 8280.28

With 95% confidence, we estimate that the change in the mean of the selling price (decrease)
of a motorcycle when the age in years of a motorcycle increase by one unit, is somewhere
between $17.12 and $59.32.

Note:
The resulting 95% confidence interval is -59.32 to -17.12. Since the interval does not contain
0, you can conclude that the true value of 1 is not 0, and you can reject the null hypothesis
H 0 :  1 = 0 in favor of H1 : 1  0 . Furthermore, the confidence interval estimate indicates
that there is a decrease of $17.12 to $59.32 in selling price for each year increase in the age of
the motorcycle.

1.7.5 Some considerations on making inferences concerning  1


The sampling distributions for ˆ1 hold true ONLY if the assumption that  i ~ NID (0,  2 )
holds; however, if the normality assumption does not hold, these sampling distributions are
usually still good approximations.
ˆ1 are somewhat “robust” against normality.

Chapter 1 - 18
BMMS2074 Statistics for Data Science

1.8 Estimating the Mean Response

Let x h be any value of the regressor variable within the range of the original data X used to
fit the model. (Note that x h may or may not be one of the values in the sample.) The mean
response E (Y x h ) = Y xh = E (Yh ) can be estimated by
Eˆ (Y x h ) = ˆ Y xh = Eˆ (Yh ) = ˆ0 + ˆ1 x h
What is the difference between ŷ h and Eˆ (Yh ) for a given value x h ?

Example 1.12:
Let X be the score for Quiz 1;
Let Y be the score for Quiz 2;
The data collected are as follows:
Quiz 1 0 2 4 6 8
Quiz 2 6 5 8 7 9

We obtained:
̂ 0 = 5.4; ˆ1 =0.4

If we want to estimate the mean Quiz 2 score for all students in the population who score a 6
(i.e. x h = 6 ) on Quiz 1, then the estimate will be
Eˆ (Y ) = ˆ + ˆ x = 5.4 + 0.4(6) = 7.8
h 0 1 h

On the other hand, we may want to predict the Quiz 2 score for a student who score a 6 (i.e.
x h = 6 ) on Quiz 1, then the estimate will be
yˆ = ˆ + ˆ x = 5.4 + 0.4(6) = 7.8 .
h 0 1 h

1.8.1 Confidence Interval for a Mean Response


Note that the variance of Eˆ (Yh ) is
 1 (x − x)2 
Var[ Eˆ (Yh )] =  2  + h 
 n S XX 
(1 −  )100% CI on the mean response at the point for x = xh ,  Y xh is

 1 ( x − x )2   1 ( xh − x ) 2 
Eˆ (Yh ) − t / 2;n−2 MS E  + h    Y xh  ˆ
E (Yh ) + t  / 2;n − 2 MS E + 
 n S XX   n S XX 

Example 1.13:
Consider the data in Ex. 1.12. Construct a 95% confidence interval for the mean Quiz 2 score
for all students who scored 6 on Quiz 1.

Chapter 1 - 19
BMMS2074 Statistics for Data Science

1.9 Prediction of a new observation


If x h is the value of the regressor of interest, the point estimate of the new value of the
response, y is
h Yˆ = ˆ + ˆ x
h 0 1 h

Note that the random variable


  1 (x − x)2  
 = Yh − Yˆh ~ N  0,  2 1 + + h  
  n S XX  
 

1.9.1 Prediction Interval for an Individual Response


(1 −  )100% prediction interval (PI) on the future observation Yh at a specified value of
x = xh is
 1 ( x − x )2   1 ( xh − x ) 2 
Yˆh − t / 2;n−2 MS E 1 + + h   Yh  Yˆh + t / 2;n−2 MS E 1 + + 
 n S XX   n S XX 

Example 1.14:
Consider the data in Ex. 1.12. Compute a 95% prediction interval for an individual student
who scores 6 on Quiz 1.

With 95% confidence, a student with a score of 6 on Quiz 1 should expect between a 3.83
and 11.77 score on Quiz 2.

Note:
Prediction intervals resemble confidence intervals. However, they differ conceptually:
(i) A CI represents an inference on a parameter and is an interval that is intended to cover
the value of the parameter.
(ii) A PI is a statement about the value to be taken by a random variable, the new
observation Yh .
Chapter 1 - 20
BMMS2074 Statistics for Data Science

1.10 Analysis of Variance Approach to Regression Analysis

The total sum of squares, denoted by SST is given by,


SST = ( y − y ) 2
(y ) 2
= SYY = y 2 −
n

The regression or model sum of squares, denoted by SS R is given as


SS R = ( yˆ i − y ) 2
and SST = SS R + SS Re s

Example 1.15: College and HS GPA

C olle ge G PA v s. HS GP A
4

3.5
ˆi
Yi − Y
3 Yi − Y
College GPA

Ŷi − Y
2.5
Y
2

1.5

1
0 yˆ i = ˆ0 +1ˆ1 xi 2 3 4 5
HS G PA

Notes:
1. SST has n − 1 degrees of freedom (1 is lost through the estimation of  by y ).
2. SS R ’s degrees of freedom corresponds to the number of independent variables in the
model.
3. “Mean squares” are formed from dividing the sum of squares by their corresponding
degrees of freedom.
SS E SS SSTO
MS E = , MS R = R ,  MS R + MS E
n−2 1 n −1
4. Analysis of variance (ANOVA) table
Source of variation df SS MS F
Regression 1 SS R MS R MS R
F=
MS E
Error n−2 SS E MS E
Total n −1 SST

Note that F follows an F − distribution with 1 degree of freedom for the numerator
and n − 2 degrees of freedom for the denominator.
Chapter 1 - 21
BMMS2074 Statistics for Data Science

1.10.1 F − Test of 1 = 0 vs. 1  0

Note:
For a given significance level  , F − test of 1 = 0 vs. 1  0 is equivalent algebraically to
the two-tailed t − test.
2
SS R  1 ˆ12  ( X i − X ) 2 ˆ12  ˆ1 
(i) The test statistic F =
*
= = 2 =  = (t ) 2
SS E  ( n − 2) MS E se ( 1 )  se( 1 ) 
ˆ  ˆ

(ii) The required percentiles of the t and F distributions for the tests:
[t (1 −  / 2; n − 2)]2 = F (1 −  ;1, n − 2) . Remember that t − test is two-tailed test
whereas the F − test is right-tailed test.
Eg: [t (0.975;23] 2 = (2.069 ) 2 = 4.28 = F (0.95;1,23)

The t − test is more flexible since it can be used for one–sided alternatives involving
H 0 : 1  0 or H 0 : 1  0 , while the F − test cannot have such tests.

Example 1.16:
Reconsider Ex. 1.12, by using  = 0.05 , is Quiz 1 linearly related to Quiz2?

1.11 Coefficient of Determination

The coefficient of determination, denoted by R 2 , represents the proportion of SST that is


explained by the use of the linear regression model, is defined as:
SS SS
R 2 = R = 1 − Re s
SST SST
It is a measure of the variability in Y without considering the effect of the regressor variable
X.

The computational formula for R 2 is


2
S XY
R =
2
and 0  R2  1.
S XX S YY

Chapter 1 - 22
BMMS2074 Statistics for Data Science

Notes:
1. R 2 measures the proportion of variation in Y that explained by the regressor variable
X . (i.e. R 2 100% of the variation in Y can be “explained” by using X to predict Y );
or The error in predicting Y can be reduced by R 2 100% when the regression model
is used instead of just y .
2. R 2 is a measure of "fit" for the regression line
0 0.25 0.5 0.75 1.0
Bad Fit Good Fit
3. r = R 2 is the coefficient of correlation. The square of this is the coefficient of
determination in simple linear regression.
S
4. From the relationship ̂1 = r YY , we obtain
S XX

2   ( yi − y )  SS R   ( yi − y ) 
 2  2
ˆ
1 = R = and SS R = ˆ12  ( xi − x )
2 2
  (x − x )2  SST   (x − x )2 
 i   i 
2
R /1 (n − 2) R 2
5. F= =
(1 − R ) /(n − 2)
2
1− R2

Example 1.17:
Reconsider Ex.1.2. Find R 2 and give an interpretation for this quantity.

Warning
1. Use R 2 as a measure of fit when the sample size is substantially larger than the
number of variables in the model; otherwise, R2 may be artificially high.
For example:
Suppose the estimated model is yˆ = ˆ0 + ˆ1 x , and a random sample of size 2 is used
to calculate ̂ and ˆ . Then a scatter plot with the estimated regression line plotted
0 1
upon it would look something like:
Y

Yˆ = ˆ 0 + ˆ1 x

X1

and R 2 = 1 . In this case the sample size is not substantially larger than the number of
variables in the model causing R2 to be artificially high.
2. R 2 is only measuring the linear relationship.
3. R 2 is a measure of how the estimated regression line fits in the sample only.
Chapter 1 - 23
BMMS2074 Statistics for Data Science

9.2 Hypothesis Testing for Population Correlation 

 A significance test can be conducted to test whether the correlation between two
variables X and Y is significant or not.

 The Null and Alternative Hypotheses.

H0 H1 Type of test
(i)  =0  0 Two-tailed test
(ii)  = 0 or   0  0 Left-tailed test
(iii)  = 0 or   0  0 Right-tailed test

 Test Statistic.

r n−2
T= ~ t n−2
1− r2

Chapter 1 - 24

You might also like