Statistical Analysis: Linear Regression
Statistical Analysis: Linear Regression
LINEAR REGRESSION
Dr. Dafter Khembo
1
Learning Objectives
2
Correlation & Regression
In the previous lesson, we studied the linear correlation between two
variables.
Correlation: Deals with relationship between two quantitative variables
(measured on Same Person)
Discusses
1) the direction of the relationship (+ve or. -ve)
2) the strength of the relationship (r: from –1 to +1)
3) in hypothesis testing, whether | r | > critical value
3
Correlation vs. Regression
4
Regression Analysis
• Regression analysis means estimating, forecasting or
predicting the unknown value of one variable from the
known value(s) of the other variable(s).
• It involves:
1. Checking for a significant linear correlation (r) between
x and y
2. If there is NOT a significant linear correlation between x
and y, then we CANNOT USE REGRESSION to predict y.
3. IF THERE IS a significant linear correlation between x
and y, the best predicted y-value is found by putting
the x-value into the regression equation and calculating
y
5
The Independent and Dependent Variables
• x is the independent variable (predictor variable). It is the variable under
the investigator’s control.
• y-hat is the dependent variable (response variable). It is the variable
which the investigator is trying to estimate or predict.
• Note the different formats that are used
y = a + bx
ŷ b0 b1 x
y-hat is the b0 is the y- b1 is the slope of x is the “independent”
“dependent” or intercept or the the regression line or “predictor” variable
“response” variable value at which or the amount of because it acts
because it depends the regression change in y for independently to
on, or responds to line crosses the every 1 unit change predict the value of y-
the value of x vertical axis in x hat
6
Regression Line
• To obtain a straight line relationship, consider the sample paired
data on sales of each of the n = 5 months of a year and the
advertising expenditure incurred in each month
7
Scatter Diagram
8
Fitting a Regression Line
If all the data points lay on a straight line, it would be simple
to draw an approximate straight line on the scatter-plot
This is not the case with real data
If the points on the scatter diagram can best be described by
a straight line, the next step is to fit a straight line on the
scatter diagram.
On the whole, the line must lie as close as possible to every
data point on the scatter diagram.
Since a straight line so fitted best approximates all the points
on the scatter diagram, it is better known as the line of best
fit.
A line of best fit can be fitted by means of:
1. Free hand drawing method, and
2. Least squares method
9
Fitting a Regression Line: Free hand
drawing
• After a careful inspection of the spread of various data
points on the scatter diagram, a straight line can be
drawn through the points such that on the whole it is
closest to every point.
• The major drawback is that the slope of the line so
drawn will vary from person to person because of the
influence of subjectivity.
• Consequently, the values of the dependent variable
estimated on the basis of such a line may not be as
accurate and precise as those based on the line of best
fit.
10
Fitting a Regression Line: Least Square
Method
• The line of best fit has all the data points as close to it as possible
• The least square method of fitting a line of best fit requires
minimizing the sum of the squares of vertical deviations of each
observed y-value from the fitted line
• Derives an equation for the line that best models the relationship
between the two variables.
• The equation has the mathematical form: y = a + bx where, y is
the value of the dependent variable, x is the value of the
independent variable, a is the intercept of the regression line on
the y axis when x = 0, and b is the slope of the regression line.
• Since a straight line is completely defined by its intercept a and
slope b, the task of fitting the same reduces only to the
computation of the values of these two constants.
• Once these two values are known, the computed y values against
each value of x can be easily obtained by substituting x values in
the linear equation
11
Scatter Diagram
12
Least Squares Line
y
y=a+bx
ei
a
13
Modelling a Straight Line
y
y=a+bx
b
a 1 unit
14
• By using the least squares method (a procedure that minimizes the
vertical deviations of plotted points surrounding a straight line) we are
able to construct a best fitting straight line to the scatter diagram points
and then formulate a regression equation in the form of:
n( xy) x y
y a bx b 2
n ( x 2
) ( x)
a y bx
15
Expenditure Sale (K1000)
Month xy x2 y2
(K1000) (x) (y)
16
Generating The Least Squares Equation
17
Uses of Regression Lines
The least squares regression line may be used to
estimate a value of the dependent variable given a value
of the independent variable
The value of the independent variable (x) should be within
the range of the given data
The predicted value of the dependent variable (y) is only
an estimate
Even though the fit of the regression line is good, it does
not prove there is a relationship between the variables
outside of the values from the given experiment
18
Assumptions
1. We are investigating only linear relationships
2. For each x value, y is a random variable
having a normal (bell-shaped) distribution. All
of these y distributions have the same
variance
Results are not seriously affected if
departures from normal distributions and
equal variances are not too extreme
19
Practical Example (from Medical Field)
Data from a study of foetal development
Date of conception (and hence age) of the foetus is
known accurately
Height of the foetus (excluding the legs) is known from
ultrasound scan
Age and length of the foetus are clearly related
Aim is to model the length and age data and use this to
assess whether a foetus of known age is growing at an
appropriate rate
20
Growth of a Foetus
21
Graphical Assessment of Data
22
Linear Regression Model
From the scatterplot, it would appear that age and
length are strongly related, possibly in a linear way
A straight line can be expressed mathematically in
the form
y a bx
Where b is the slope, or gradient of the line, and a is
the intercept of the line with the y-axis
23
Fitted Line
24
Interpretation of Results
The regression equation is …
length 2.66 0.12 age
This implies that as the age of the foetus increases by one day,
the length increases by 0.12cm for a foetus of age 85 days, the
estimated length would be
length 2.66 (0.12 85) 7.51
A prediction interval gives the range of values between which the
value for an individual is likely to lie: (7.01 to 8.08cm).
Outside this range, the foetus of a known age is probably not
growing at an appropriate rate
If measured length is <7.01cm, there is evidence that the foetus is
not growing as it should
If measured length is >8.08cm, is the foetus larger than expected?
Is the actual age (and due date) wrong?
25
Exercise
use the calculated least squares linear regression line
to estimate the size of a foetus at the following
gestation times:
(a) 2 days
(b) 60 days
(c) 100 days
(d) 300 days
26
Exercise
A sample of 6 persons was selected the value of their age ( x
variable) and their weight are presented in the following table.
Find the regression equation and what is the predicted weight
when age is (i) 8.5 years and (ii) 7.5 years?
27
Answer
28
Answer
n( xy ) x y
b 2
x ) ( x)
n ( 2
29
y (x) 4.693 0.923x
30
Regression Line
Example:
The following data represents the number of hours 12 different
students watched television during the weekend and the scores of
each student who took a test the following Monday.
Predicted y- d
3
value
x
Each data point di represents the difference between the observed
y-value and the predicted y-value for a given x-value on the line.
These differences are called residuals.
32
Multiple Regression Equation
Multiple regression analysis is a straightforward extension of
simple regression analysis which allows more than one
independent variable.
This is because, in some instances, a better prediction can be found
for a dependent (response) variable by using more than one
independent (explanatory) variable.
For example, a more accurate prediction of Monday’s test grade
from the previous slide might be made by considering the number
of other classes a student is taking as well as the student’s previous
knowledge of the test material.
33
Multiple Regression
34
Coefficient of Determination
The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by
variation in the independent variable
E xpla in ed va r ia t ion
r2
Tot a l va r ia t ion
Example:
The correlation coefficient for the data that represents the number
of hours students watched television and the test scores of each
student is r 0.831. Find the coefficient of determination.
35
References
36