0% found this document useful (0 votes)
19 views32 pages

Categorical Slide2024

Categorical

Uploaded by

mereninnas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views32 pages

Categorical Slide2024

Categorical

Uploaded by

mereninnas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Simple Linear Regression

• Simple linear regression analysis involves only two variables,


say X and Y.

• The data for such relationships are provided as pairs of


observation for X and Y as: (X1,Y1) (X2, Y2) ….(Xn,Yn)

• Suppose that we are interested in two quantitative variables, say


X and Y of a population in such a way that one of them
influences the other.

• The mathematical relationship between X and Y is described by:


Y=a+βX

• where a is a constant and β is a non-zero real number.


• The two quantities, a and β are called regression coefficients.
The Method of least square
► The values ‘a’ and ‘b’ in the equation are constants, i.e., their
values are fixed.

► The constant ‘a’ indicates the value of y when x=0. It is also


called the y intercept.

► The value of ‘b’ shows the slope of the regression line and gives
us a measure of the change in y for a unit change in x.

► This slope (b) is frequently termed as the regression coefficient


of Y on X.

► If we know the values of ‘a’ and ‘b’, we can easily compute the
value of Ŷ for any given value of X.
OLS estimates of coefficients
Based on the least squares estimation, the coefficients of
the estimated regression line y= a + bx are given by:

n n n n
 (x i  x)(yi  y)  x i y i  ( x i )( y i ) n
b  i 1 n  i 1 n i 1
n
i 1

 (x  x) 2
 x i  ( x i ) 2 /n
2
i 1 i 1 i 1

a  y  bx

Advanced Biostatistics by: Yasin 3


SLR-example
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.

Father (X) oldest son (Y) product (XY) X²


63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041

Total 676 679 45967 45784


SLR-example
a =Y - b X

n XY   X  Y  ( X  X )(Y  Y )
b= n  X 2  ( X ) 2 = (X  X ) 2

10(45967)  (676x 679) 459670 459004 666


b= 10( 45784)  (676) 2
= 457840 456976 = 864 = 0.77

679 676
a= 10
- 0.77 ( 10
) = 67.9 – 52.05 = 15.85

Therefore, Ŷ = 15.85 + 0.77 X


The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change in X.
SLR-example
Estimate the height of the oldest son for a father’s height of
70 inches.

Ŷ = 15.85 + 0.77 (70) = 69.75 inches

NB: 1) n is the number of pairs of X and Y scores


which are used in determining the regression line.
In the above example, n=10.

2) Be careful to distinguish between (ΣX)² and Σχ².


Assumptions
The assumptions made when using this method are:

♣ The relationship between the outcome and the


explanatory variable is linear or at least approximately
linear;

♣ At each value of the explanatory variable the outcomes


follow a normal distribution;

♣ The variance of the outcome is constant for all values of


the explanatory variable.
Assumptions of linear regression

*
*
Assumption 1 **
*
*
*
*
Linear relationship ** * *
Assumption 2 **
**
*
*
*

Y normally distributed **
**
*

at each value of x
Assumption 3
Same variance at each value of x

8
Checking Assumptions:
Assumption 1: linear relationship
Plot y against x to check for linearity

9
Checking Assumptions:
Assumption 2: Normality

Histogram of residuals
Dependent variable BMI
Normal P-P Plot of Standardized Residual

1.0

0.8

Expected Cum Prob


0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Observed Cum Prob


10
Checking Assumptions:
Assumption 3: Spread of y values constant over range of x values
(plot of residuals against X)

11
Exercise
Suppose we have the following dataset with the weight
and height of seven individuals.

Let weight be the predictor variable and height be the


response variable. Then
a. Fit the regression equation
b. Interpret the result
c. Predict the weight of individual for height of 82 inches
Layout

• Confidence interval estimation of Bo and B1

• Hypothesis testing

• Correlation coefficient

• Coefficient of determination
Interval estimation of the regression parameters
Where
Example

Construct 95% confidence intervals for B1 and B0

SSE=Syy- 𝞫1 Sxy = 32.1-1.15*23 =5.65


𝑆𝑆𝐸 5.65
The estimate for 𝞼𝞮 2 = = =0.84
𝑛−2 8
• Cc
Hypothesis testing
Correlation Analysis
• Correlation is the method of analysis to use when
studying the possible association between two
continuous variables

• The standard method (Pearson Correlation) leads to


a quantity called r that can take on any value from -
1 to +1

• The correlation coefficient r measures the degree of


'straight-line' association between the values of
two variables
Correlation Analysis ..Cont’d
• The correlation between two variables is
positive if
– higher values of one variable are associated with
higher values of the other and
• negative if
– one variable tends to be lower as the other gets
higher
• A correlation of around zero indicates that
there is no linear relation between the values
of the two variables
Fig.1: Systolic Blood Pressure against Age
If we have two variables X and Y, the correlation
between them denoted by r(X, Y) is given by:

 (xi  x )(yi  y)  xy
r 
 i   i  x y
2 2 2 2
(x x ) (y y )
 XY  [  X  Y ] / n

[  X 2  (  X ) 2 / n][  Y 2  (  Y ) 2 / n]

where xi and yi are the values of X and Y for the ith individual

The equation is clearly symmetric as it does not matter which


variable is X and which is Y
Pearson’s r Correlation
• As a rule of thumb, the following guidelines on
strength of relationship are often useful (though
many experts would somewhat disagree on the
choice of boundaries).
Correlation value Interpretation
 0.70 or higher Strong relationship
 0.39 to 0.69 Moderate relationship
 0.20 to 0.39 Moderate relationship
 0.01 to 0.19 No or negligible relationship
Coefficient of determination (R2)
• The coefficient of determination (R ²) measures how
well a statistical model predicts an outcome.
• The outcome is represented by the model’s dependent
variable.
• The lowest possible value of R ² is 0 and the highest
possible value is 1.

• It determines how the independent variable explains


the dependent variable
Example: The following data shows the respective weight of a sample
of 12 fathers and their oldest son. Compute the correlation coefficient
between the two weight measurements
Wt of father – X Wt of son – Y
X2 Y2 XY
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 4096 4225 4160
68 69 4624 4761 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4828
67 67 4489 4489 4489
69 68 4761 4624 4692
71 70 5041 4900 4970
Scatter Plot
Scatter plot of father's by son's weight

72
71
70
69
68
67
66
65
Y

64
60 62 64 66 68 70 72
X
Calculating r
The correlation coefficient for the data on fathers’ and
sons’ will be:
Basic values from the data
 X  800,  X  53,418, Y  811, Y  54,849,  XY  54,107
2 2

 (x - x )(y  y)   xy  ( x )( y)/n  54,107  (800 811)/12  40.33


2 2 2 2
 ( x  x)   x  ( x) / n  53,418  (800) / 12  84.67
2 2 2 2
 ( y  y )   y  ( y ) / n  54,849  (811) / 12  38.92
Calculating r
40.33
r  0.703
(84.67)(38.92)
Exercise
• Given the following table about the relationship between
corporate bond expected return (Y) per year and its potential
risk (X), both in percentage term

a. Write the fitted sample regression line


b. Interpret the results obtained from the linear relationship?
c. Compute the value of r and 𝑟 2 then discuss what it implies
d. Compute the variance and standard error of 𝞫𝟎 and 𝞫𝟏
e. Construct a 95% confidence interval
f. Test the hypothesis that H0 : 𝞫 =0 against the alternative Ha :
𝞫 ≠ 0 at 5% significance level using both t-test and
Confidence internal approach

You might also like