Chapter 1
Chapter 1
Contents:
Chapter 1. Simple Linear Regression
Chapter 8. Forecasting
Reference Books:
1. Wackerly, D. D., Mendenhall, W. & Scheaffer, R. L. (2008). Mathematical Statistics
with Applications. (7th ed.). Thomson.
2. Ken Black (2013). Applied Business Statistics: Making Better Business Decisions. (7th
ed.). John Wiley.
3. Chris Chatfield. 2004. The Analysis of Time Series : An Introduction. 6th Edition.
Chapman & Hall.
Chapter 1 - 1
BMMS2074 Statistics for Data Science
1.1 Introduction
Suppose we are interested in estimating the average GPA of all students at TAR UC. How
would we do this? (Assume we do not have access to any student records.)
The diagram below demonstrates these steps. Note that not all GPAs could be shown in the
diagram.
Take Sample
3.6
2.4
2.7
2.8
2.9 Inference Population
Sample 3.9
3.2
3.4 2.8 3.4
2.9
3.6
1.2
4.0
Y
Chapter 1 - 2
BMMS2074 Statistics for Data Science
Take Sample
(2.8, 3.6)
(2.2, 2.4)
(2.7, 2.6)
(2.9, 2.8)
(3.0, 2.9) Inference Population
Sample (4.0, 3.9) (3.1, 3.2)
(2.9, 2.8) (2.2, 3.4)
(3.0, 2.9)
(2.8, 3.6)
(2.2, 3.4) (2.2, 1.2)
(3.8, 4.0)
1.1.1 Scatterplot
The main objective of this chapter is to analyze a collection of paired sample data (or
bivariate data) and determine whether there appears to be a relationship between the two
variables.
A correlation exists between two variables when one of them is related to the other in some
way.
A scatterplot (or scatter diagram) is a graph in which the paired (x, y) sample data are plotted
with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is plotted as a single
point.
Example
Suppose we take a sample of seven households and collect information on their incomes and
food expenditures for the past month. The information obtained (in hundreds of RM) is given
below.
Income (hundreds) 35 49 21 39 15 28 25
Food expenditure (hundreds) 9 15 7 11 5 8 9
Solution
The scatter diagram for this set of data is
16
14
12
10
8
6
4
10 20 30 40 50
Chapter 1 - 3
BMMS2074 Statistics for Data Science
The explanatory, independent or predictor variable attempts to explain the response and is
usually denoted by X .
A scatter plot shows the relationship between two quantitative variables X and Y . The
values of the X variable are marked on the horizontal axis, and the values of the Y variable
are marked on the vertical axis. Each pair of observations ( xi , y i ) is represented as a point in
the plot.
Two variables are said to be positively associated if, as X increases, the value of Y tends to
increase. Two variables are said to be negatively associated if, as X increases, the value of
Y tends to decrease.
Interpretation:
Form, Direction, Strength, Any Deviations
100
90
80
70
60
50
40
30
10 20 30 40 50
Figure D: shows a very strong association, not a linear one, but rather more quadratic
(curvilinear).
Chapter 1 - 4
BMMS2074 Statistics for Data Science
Example 1.1
A random sample of UTARUC students is taken producing the data set below.
Student X (HS GPA) Y (College GPA)
1 x1=3.04 y1=3.10
2 x2=2.35 y2=2.30
3 2.70 3.00
. . .
. . .
. . .
18 4.00 3.80
19 2.28 2.20
20 1.88 1.60
4.00
3.50
3.00
Y (College GPA)
2.50
2.00
1.50
1.00
0.50
0.00
0.00 1.00 2.00 3.00 4.00 5.00
X (HS GPA)
It shows fairly strong positive linear association between College GPA and HS GPA.
Statistical relation (Regression) between two variables is not a perfect fit. In general, the
observations do not fall directly on the curve of relationship.
Example:
Consider the relation between dollar sales ( Y ) of a product sold at a fixed price and number
of units sold ( X ). If the selling price is RM2 per unit, the relation is expressed by the
equation:
Y = 2X
Number of Units Sold, X Sales, Y (RM)
75 150
25 50
130 260
Chapter 1 - 5
BMMS2074 Statistics for Data Science
Example:
Performance evaluations for 23 employees were obtained at midyear (0 – 10 scale) and at
year-end (0 – 400 points). These data are plotted in the following figure.
The figure clearly suggests that there is a positive linear relation between midyear and year–
end evaluation. However, the relation is not a perfect fit. The scattering of the points
suggesting that some of the evaluations is not accounted for by midyear performance
assessments. For instance, two employees had midyear evaluation of x = 4 , and yet they
received different year–end evaluation.
Suppose you are interested in studying the relationship between two variables X and Y .
Take Sample
ˆ i = ?0 + 1x i
y
yi = 0 + 1 xi + i
yi = 0 + 1 xi + i
where:
(i) y i is the value of the response (dependant) variable in the ith trial/observation.
(ii) The regressor x i is a known constant (fix). (i.e. the value in the predictor
(independent) variable in the ith trial).
(iii) The intercept 0 and the slope 1 are unknown constants (parameters).
(iv) i is the random error
Assumptions
1. The error terms i are normally and independently distributed with E ( i ) = 0 and
constant variance Var ( i ) = 2 .
2. The error (thus, the y i , also) are uncorrelated with each other.
i ~ NID (0, 2 )
3. E(Y | x) = 0 + 1 x ; Var(Y | x) = 2
Chapter 1 - 6
BMMS2074 Statistics for Data Science
Note:
The above model is said to be simple, linear in the parameters, and linear in the
predictor variables.
It is “simple” in that there is only one predictor variable, “linear in the parameters”
because no parameter appears as an exponent or is multiplied or divided by another
parameter, and “linear in the predictor variable,” because this predictor variables
appears only in the first power.
A model that is linear in the parameters and in the predictor variable is also called a
first–order model.
The parameters 0 and 1 are unknown and can be estimated using n pairs of sample data
( x1 , y1 ), ( x2 , y 2 ), ... , ( xn , y n ) .
Population
Linear Regression Model
Y Yi i= 0+0+
1X1Xi i + i
i+
Observed
Value
YX = + 1X i
0
E(Y)= 0 + 1X
X
Observed Value
Sample
Linear Regression Model
Y = 0b+0+
YYi i= i
1Xb1X
i+ ei
ei
Unsampled Value
= bˆ + bˆ X
i = 00 + 1
YŶ
1X i
X
Sampled Value
Chapter 1 - 7
BMMS2074 Statistics for Data Science
The line that minimizes the sum of squares of the deviations of observed values of y i from
those predicted is the best–fitting line.
( ) ( )
n
S ˆ0 , ˆ1 = ei2 = ( yi − yˆ ) = yi − ˆ0 − ˆ1 xi
2 2
i =1
which is the error sum of squares.
As discussed, the best–fitted line is that one which minimized LS, that is
S S
=0 and =0
ˆ 0 ˆ1
(xi − x )( yi − y )
n
S XY
= i =1 =
( xi − x )
n 2 S XX
i =1
n n
A test of the second partial derivatives will show that a minimum is obtained with the least
squares estimators of ̂ 0 and ˆ1 .
Thus, the fitted simple linear regression model (estimated regression equation or line) is
yˆ = ˆ0 + ˆ1 x
Chapter 1 - 8
BMMS2074 Statistics for Data Science
Properties of LSE:
(x i − x )( y i − y ) (x i − x ) y i − (x i − x ) y (x i − x ) y i
n n n n
i =1
the y i and hence is a linear estimator. In the similar fashion, it can be shown that ̂ 0
is a linear estimator as well.
i. ̂ 0 is the BLUE of 0 .
ii. ˆ1 is the BLUE of 1 .
iii. c1ˆ0 + c2 ˆ1 is the BLUE of c1 0 + c2 1 .
3. The sum of the observed values y i equals the sum of the fitted values ŷ i .
n n
yi = yˆ i
i =1 i =1
4. The sum of the residuals weighted by the corresponding value of the regressor
variable always equals zero.
n
xi ei = 0
i =1
5. The sum of the residuals weighted by the corresponding fitted values always equals
zero.
n
yˆ i ei = 0
i =1
Example 1.2
What is the relationship between sales and advertising costs for a company?
x y x2 y2 xy
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4
2
1
0
0 1 2 3 4 5 6
Advertising
Chapter 1 - 10
BMMS2074 Statistics for Data Science
Example 1.3
(a) What do the estimated parameters in Ex. 1.2 mean?
(b) What are the estimated sales when the advertising cost is RM100,000 and
RM250,000, respectively?
Extrapolation is using the regression line to predict the value of a response corresponding to
a x value that is outside the range of the data used to determine the regression line.
Extrapolation can lead to unreliable predications
Chapter 1 - 11
BMMS2074 Statistics for Data Science
1.4 Estimating 2
Population simple linear regression model:
yi = 0 + 1xi + i where i ~ NID (0, 2 )
Note:
(a) SS E has n − 2 degrees of freedom associated with it. Two degrees of freedom are lost
due to the estimation of ̂ 0 and ˆ (remember that yˆ = ˆ + ˆ x ).
1 i 0 1 i
(b) SS E = S YY − ̂1 S XY
1.5 Correlation
The linear correlation coefficient, r (or R ), (is also called the Pearson product moment
correlation coefficient) measures the strength of the linear relationship between the paired x-
and y-quantitative values in a sample. It describes the direction of the linear association and
indicates how closely the points in a scatter plot are to the least squares regression line
S XY n( xi yi ) − ( xi )( yi )
r= =
S XX SYY ( ) (
n xi2 − ( xi )2 n yi2 − ( yi )2 )
Chapter 1 - 12
BMMS2074 Statistics for Data Science
Y Y
• • ••
Y •
• • • • •• •
•• •
• • • • •• •
• •
X X X
No relationship positive perfect positive
linear correlation linear correlation
Y Y Y
• • • • •
• • • • •
• • • • •
• • ••
X X X
Non−linear negative perfect negative
relationship linear correlation linear correlation
Example 1.6:
Graph A: ___________ Graph B: ___________
y y
x x
Graph C: ___________ Graph D: ___________
y y
x x
Chapter 1 - 13
BMMS2074 Statistics for Data Science
Example 1.7:
Compute the correlation coefficient r for Test 1 versus Test 2
x y x2 y2 xy
8 9 64 81 72
10 13
y
Least Squares Regression Line
slope = b
y *
passes through this point
x
x
Note: The least squares regression line always passes through the point ( x , y ) .
Example 1.8:
The scores on the midterm and final exam for 500 students were obtained. The possible
values for each exam are between 0 and 100. The least squares regression line for predicting
the final exam from the midterm exam was obtained for these data. Suppose the correlation
coefficient is 0.5 for these data, r = 0.5 .
Susan, a student in this class, received a midterm score that was one standard deviation above
the average midterm score. Suppose the average and standard deviation for the midterm
scores were 80 and 10, respectively. Also suppose that the average and standard deviation
for the final exam scores were 60 and 20, respectively.
Chapter 1 - 14
BMMS2074 Statistics for Data Science
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Yi
22 23 24 25 26 27
0+ 1Xi
Note:
1) There is a probability distribution of Y for each level of X .
2) The means of these probability distributions vary in some systematic fashion with X .
3) All the probability distributions of y i exhibit the same variability, 2 , in conformance
with the assumptions of simple regression model .
Thus, the response Yi , when the level of X in the ith trial is X i , comes from a probability
distribution whose mean is:
E(Yi ) = 0 + 1 X i
Chapter 1 - 15
BMMS2074 Statistics for Data Science
2
Since ˆ1 = k i y i ~ NID( 1 , )
S XX
To test the hypothesis that the slope equals a constant, we have H 0 : 1 = 10 and the test
ˆ − 10
statistic is z = 1 ~ N (0, 1) .
2
S XX
If 2 is unknown and the unbiased estimator MS E and the test statistic becomes
ˆ1 − 10
t= ~ t (n − 2)
MS E
S XX
Yi=3 + 0*X + i
5
4
Y
0 =3
2
E(Y)=3 + 0*X
0
0 2 4 6 8
Note:
1. The hypothesis testing can be done via (i) t − test or z − test, or (ii) analysis of
variance (ANOVA)
Chapter 1 - 16
BMMS2074 Statistics for Data Science
Example 1.9:
For Ex. 1.2, Is advertising linearly related to sales? Use = 0.05 .
To test the hypothesis that the intercept equals a constant, we have H 0 : 0 = 00 and the test
̂ 0 − 00
statistic is t = ~ t (n − 2) ,
1 x2
MS E +
n S XX
1 x2
The term se( ˆ 0 ) = MS E + is the (estimated) standard error of ̂ 0 .
n S XX
Example 1.10:
Refer to Ex.1.7, Test 1 vs Test 2.
(a) What is the regression line that relate Test 1 to Test 2.
(b) Is there sufficient evidence to conclude that a linear relationship exists between Test 1
and Test 2? Use = 0.05 .
(c) Test whether there is a direct(positive) relationship between Test 1 and Test 2. Use
= 0.05 .
(d) The following test has no practical significance in this problem. Test whether the
intercept is zero.
Chapter 1 - 17
BMMS2074 Statistics for Data Science
Example 1.11:
At a used car dealership, let X be an independent variable representing the age in years of a
motorcycle and Y be the dependent variable representing the selling price of a motorcycle.
Find a 95% confidence interval for 1 .
xi yixi 2 yi 2 xi y i ( xi − x ) 2 ( y i − yˆ ) 2
5 500 25 250000 2500 38.44 1367.52
10 400 100 160000 4000 1.44 2923.56
12 300 144 90000 3600 0.64 929.64
14 200 196 40000 2800 7.84 47.75
15 100 225 10000 1500 14.44 3011.81
56 1500 690 550000 14400 62.8 8280.28
With 95% confidence, we estimate that the change in the mean of the selling price (decrease)
of a motorcycle when the age in years of a motorcycle increase by one unit, is somewhere
between $17.12 and $59.32.
Note:
The resulting 95% confidence interval is -59.32 to -17.12. Since the interval does not contain
0, you can conclude that the true value of 1 is not 0, and you can reject the null hypothesis
H 0 : 1 = 0 in favor of H1 : 1 0 . Furthermore, the confidence interval estimate indicates
that there is a decrease of $17.12 to $59.32 in selling price for each year increase in the age of
the motorcycle.
Chapter 1 - 18
BMMS2074 Statistics for Data Science
Let x h be any value of the regressor variable within the range of the original data X used to
fit the model. (Note that x h may or may not be one of the values in the sample.) The mean
response E (Y x h ) = Y xh = E (Yh ) can be estimated by
Eˆ (Y x h ) = ˆ Y xh = Eˆ (Yh ) = ˆ0 + ˆ1 x h
What is the difference between ŷ h and Eˆ (Yh ) for a given value x h ?
Example 1.12:
Let X be the score for Quiz 1;
Let Y be the score for Quiz 2;
The data collected are as follows:
Quiz 1 0 2 4 6 8
Quiz 2 6 5 8 7 9
We obtained:
̂ 0 = 5.4; ˆ1 =0.4
If we want to estimate the mean Quiz 2 score for all students in the population who score a 6
(i.e. x h = 6 ) on Quiz 1, then the estimate will be
Eˆ (Y ) = ˆ + ˆ x = 5.4 + 0.4(6) = 7.8
h 0 1 h
On the other hand, we may want to predict the Quiz 2 score for a student who score a 6 (i.e.
x h = 6 ) on Quiz 1, then the estimate will be
yˆ = ˆ + ˆ x = 5.4 + 0.4(6) = 7.8 .
h 0 1 h
1 ( x − x )2 1 ( xh − x ) 2
Eˆ (Yh ) − t / 2;n−2 MS E + h Y xh ˆ
E (Yh ) + t / 2;n − 2 MS E +
n S XX n S XX
Example 1.13:
Consider the data in Ex. 1.12. Construct a 95% confidence interval for the mean Quiz 2 score
for all students who scored 6 on Quiz 1.
Chapter 1 - 19
BMMS2074 Statistics for Data Science
Example 1.14:
Consider the data in Ex. 1.12. Compute a 95% prediction interval for an individual student
who scores 6 on Quiz 1.
With 95% confidence, a student with a score of 6 on Quiz 1 should expect between a 3.83
and 11.77 score on Quiz 2.
Note:
Prediction intervals resemble confidence intervals. However, they differ conceptually:
(i) A CI represents an inference on a parameter and is an interval that is intended to cover
the value of the parameter.
(ii) A PI is a statement about the value to be taken by a random variable, the new
observation Yh .
Chapter 1 - 20
BMMS2074 Statistics for Data Science
C olle ge G PA v s. HS GP A
4
3.5
ˆi
Yi − Y
3 Yi − Y
College GPA
Ŷi − Y
2.5
Y
2
1.5
1
0 yˆ i = ˆ0 +1ˆ1 xi 2 3 4 5
HS G PA
Notes:
1. SST has n − 1 degrees of freedom (1 is lost through the estimation of by y ).
2. SS R ’s degrees of freedom corresponds to the number of independent variables in the
model.
3. “Mean squares” are formed from dividing the sum of squares by their corresponding
degrees of freedom.
SS E SS SSTO
MS E = , MS R = R , MS R + MS E
n−2 1 n −1
4. Analysis of variance (ANOVA) table
Source of variation df SS MS F
Regression 1 SS R MS R MS R
F=
MS E
Error n−2 SS E MS E
Total n −1 SST
Note that F follows an F − distribution with 1 degree of freedom for the numerator
and n − 2 degrees of freedom for the denominator.
Chapter 1 - 21
BMMS2074 Statistics for Data Science
Note:
For a given significance level , F − test of 1 = 0 vs. 1 0 is equivalent algebraically to
the two-tailed t − test.
2
SS R 1 ˆ12 ( X i − X ) 2 ˆ12 ˆ1
(i) The test statistic F =
*
= = 2 = = (t ) 2
SS E ( n − 2) MS E se ( 1 ) se( 1 )
ˆ ˆ
(ii) The required percentiles of the t and F distributions for the tests:
[t (1 − / 2; n − 2)]2 = F (1 − ;1, n − 2) . Remember that t − test is two-tailed test
whereas the F − test is right-tailed test.
Eg: [t (0.975;23] 2 = (2.069 ) 2 = 4.28 = F (0.95;1,23)
The t − test is more flexible since it can be used for one–sided alternatives involving
H 0 : 1 0 or H 0 : 1 0 , while the F − test cannot have such tests.
Example 1.16:
Reconsider Ex. 1.12, by using = 0.05 , is Quiz 1 linearly related to Quiz2?
Chapter 1 - 22
BMMS2074 Statistics for Data Science
Notes:
1. R 2 measures the proportion of variation in Y that explained by the regressor variable
X . (i.e. R 2 100% of the variation in Y can be “explained” by using X to predict Y );
or The error in predicting Y can be reduced by R 2 100% when the regression model
is used instead of just y .
2. R 2 is a measure of "fit" for the regression line
0 0.25 0.5 0.75 1.0
Bad Fit Good Fit
3. r = R 2 is the coefficient of correlation. The square of this is the coefficient of
determination in simple linear regression.
S
4. From the relationship ̂1 = r YY , we obtain
S XX
2 ( yi − y ) SS R ( yi − y )
2 2
ˆ
1 = R = and SS R = ˆ12 ( xi − x )
2 2
(x − x )2 SST (x − x )2
i i
2
R /1 (n − 2) R 2
5. F= =
(1 − R ) /(n − 2)
2
1− R2
Example 1.17:
Reconsider Ex.1.2. Find R 2 and give an interpretation for this quantity.
Warning
1. Use R 2 as a measure of fit when the sample size is substantially larger than the
number of variables in the model; otherwise, R2 may be artificially high.
For example:
Suppose the estimated model is yˆ = ˆ0 + ˆ1 x , and a random sample of size 2 is used
to calculate ̂ and ˆ . Then a scatter plot with the estimated regression line plotted
0 1
upon it would look something like:
Y
Yˆ = ˆ 0 + ˆ1 x
X1
and R 2 = 1 . In this case the sample size is not substantially larger than the number of
variables in the model causing R2 to be artificially high.
2. R 2 is only measuring the linear relationship.
3. R 2 is a measure of how the estimated regression line fits in the sample only.
Chapter 1 - 23
BMMS2074 Statistics for Data Science
A significance test can be conducted to test whether the correlation between two
variables X and Y is significant or not.
H0 H1 Type of test
(i) =0 0 Two-tailed test
(ii) = 0 or 0 0 Left-tailed test
(iii) = 0 or 0 0 Right-tailed test
Test Statistic.
r n−2
T= ~ t n−2
1− r2
Chapter 1 - 24