Simple Linear Regression Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Simple Linear Regression Analysis:

A linear regression model attempts to explain the relationship between two or more variables
using a straight line. Consider the data obtained from a chemical process where the yield of the
process is thought to be related to the reaction temperature (see the table below).

Regression Formula:
Regression Equation(y) = a + bx

Slope (b) = (NΣXY - (ΣX) (ΣY)) / (NΣX2 - (ΣX) 2)

Intercept (a) = (ΣY - b (ΣX)) / N

Where,
X and y are the variables.

b = the slope of the regression line

a = the intercept point of the regression line and the y axis.

N = Number of values or elements

X = First Score

Y = Second Score

ΣXY = Sum of the product of first and Second Scores

ΣX = Sum of First Scores

ΣY = Sum of Second Scores

ΣX2 = Sum of square First Score

Regression Example:

To find the Simple/Linear Regression of

X Values Y Values

60 3.1

61 3.6
62 3.8

63 4

65 4.1

To find regression equation, we will first find slope, intercept and use it to form regression equation.
Step 1:
Count the number of values. N = 5
Step 2:
Find XY, X2 See the below table

X Value Y Value X*Y X*X

60 3.1 60 * 3.1 = 186 60 * 60 = 3600

61 3.6 61 * 3.6 = 219.6 61 * 61 = 3721

62 3.8 62 * 3.8 = 235.6 62 * 62 = 3844

63 4 63 * 4 = 252 63 * 63 = 3969

65 4.1 65 * 4.1 = 266.5 65 * 65 = 4225

Step 3:
Find ΣX, ΣY, ΣXY, ΣX2.
ΣX = 311 ΣY = 18.6 ΣXY = 1159.7 ΣX2 = 19359

Step 4:
Substitute in the above slope formula given.
Slope (b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/ (96795 - 96721)
= 13.9/74 = 0.19
Step 5:
Now, again substitute in the above intercept formula given.

Intercept (a) = (ΣY - b (ΣX)) / N

= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5

= -40.49/5

= -8.098
Step 6:
Then substitute these values in regression equation formula Regression Equation(y) = a + bx

= -8.098 + 0.19x.

Suppose if we want to know the approximate y value for the variable x = 64. Then we can substitute the
value in the above equation.

Regression Equation(y) = a + bx

= -8.098 + 0.19(64).

= -8.098 + 12.16

= 4.06

Hypothesis Tests in Simple Linear Regression

The following sections discuss hypothesis tests on the regression coefficients in simple linear
regression. These tests can be carried out if it can be assumed that the random error term, , is
normally and independently distributed with a mean of zero and variance of .
t Tests

The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple
linear regression. A statistic based on the distribution is used to test the two-sided hypothesis
that the true slope, , equals some constant value, . The statements for the hypothesis test
are expressed as:

The test statistic used for this test is:

where is the least square estimate of , and is its standard error. The value of can
be calculated as follows:
The test statistic, , follows a distribution with degrees of freedom, where is the total
number of observations.

Example

The test for the significance of regression for the data in the preceding table is illustrated in this
example. The test is carried out using the test on the coefficient . The hypothesis to be
tested is . To calculate the statistic to test , the estimate, , and the standard
error, , are needed. The value of was obtained in this section. The standard error can
be calculated as follows:

Then, the test statistic can be calculated using the following equation:

The value corresponding to this statistic based on the distribution with 23 (n-2 = 25-2 = 23) degrees
of freedom can be obtained as follows:
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected
indicating that a relation exists between temperature and yield for the data in the preceding table.

Correlation Analysis

In correlation analysis, we estimate a sample correlation coefficient, more specifically


the Pearson Product Moment correlation coefficient. The sample correlation coefficient,
denoted r,

ranges between -1 and +1 and quantifies the direction and strength of the linear
association between the two variables. The correlation between two variables can be positive
(i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e.,
higher levels of one variable are associated with lower levels of the other).

The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.

For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.

The formula for the sample correlation coefficient is

where Cov(x,y) is the covariance of x and y defined as

are the sample variances of x and y, defined as

Example - Correlation of Gestational Age and Birth Weight


A small study is conducted involving 17 infants to investigate the association between
gestational age at birth, measured in weeks, and birth weight, measured in grams.
The variance of gestational age is:

Next, we summarize the birth weight data. The mean birth weight is:

The variance of birth weight is computed just as we did for gestational age as shown in the table
below.
The variance of birth weight is:

Next we compute the covariance,

To compute the covariance of gestational age and birth weight, we need to multiply the
deviation from the mean gestational age by the deviation from the mean birth weight for each
participant (i.e.,

The computations are summarized below. Notice that we simply copy the deviations from the
mean gestational age and birth weight from the two tables above into the table below and
multiply.
The covariance of gestational age and birth weight is:

We now compute the sample correlation coefficient:

Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.

CONFIDENCE INTERVALS FOR THE MEAN, UNKNOWN VARIANCE:

Sigma Known Sigma Unknown

n≥30 ̅ ̅
√ √

n<30 ̅ ̅
√ √
Eample:

n = 8, = 0.2, s = 0.07, α = 0.05, df = n − 1 = 7. From Table 6, t α/2 = t0.025 = 2.365. Therefore, the confidence
interval is

Solution:

̅

0.2 2.365

0.2 0.059

(0.141, 0.259)

Confidence Interval for Two Independent Samples

If n1 > 30 and n2 > 30 If n1 < 30 or n2 < 30

Use Z table for standard normal


distribution Use t-table with df=n1+n2-2

Large Sample Example

The table below summarizes data n=3,539 participants attending the 7th examination of the
Offspring cohort in the Framingham Heart Study.
Men Women

Characteristic
N s n s
Systolic Blood 1,623 128.2 17.5 1,911 126.5 20.1
Pressure

Diastolic Blood 1,622 75.6 9.8 1,910 72.6 9.7


Pressure

Total Serum 1,544 192.4 35.2 1,766 207.1 36.7


Cholesterol

Weight 1,612 194.0 33.8 1,894 157.7 34.6

Height 1,545 68.9 2.7 1,781 63.4 2.5

Body Mass 1,545 28.8 4.6 1,781 27.6 5.9


Index

Small Sample Example


We previously considered a subsample of n=10 participants attending the 7th examination of
the Offspring cohort in the Framingham Heart Study. The following table contains descriptive
statistics on the same continuous characteristics in the subsample stratified by sex.

Men Women

Characteristic n Sample s n Sample s


Mean Mean

Systolic Blood 6 117.5 9.7 4 126.8 12.0


Pressure

Diastolic Blood 6 72.5 7.1 4 69.5 8.1


Pressure

Total Serum 6 193.8 30.2 4 215.0 48.8


Cholesterol

Weight 6 196.9 26.9 4 146.0 7.2

Height 6 70.2 1.0 4 62.6 2.3

Body Mass Index 6 28.0 3.6 4 26.2 2.0


Confidence Interval for One Sample

Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study
there were 1,219 participants being treated for hypertension and 2,313 who were not on
treatment. If we call treatment a "success", then x=1,219 and n=3,532. The sample proportion
is

= x/n = 1,219/3,532 = 0.345


This is the point estimate, i.e., our best estimate of the proportion of the population on treatment
for hypertension is 34.5%. The sample is large, so the confidence interval can be computed
using the formula:

Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study
there were 1,219 participants being treated for hypertension and 2,313 who were not on
treatment. If we call treatment a "success", then x=1,219 and n=3,532. The sample proportion
is
= x/n = 1,219/3,532 = 0.345
This is the point estimate, i.e., our best estimate of the proportion of the population on treatment
for hypertension is 34.5%. The sample is large, so the confidence interval can be computed
using the formula:

Confidence Interval for -

Confidence interval

s2 s2
( x  x )  tdf * 1  2
1 2 n n
1 2
where tdf * is the value from the t-table
that corresponds to the confidence level
2
s2
s 2

  
1 2

df   n1 n2 
2 2
1  s1 2
1  s2 
2

    
n1  1  n1  n2  1  n2 

Question:

home: x1  68.25 s1  21.8 n1  8


road: x2  68.63 s2  8.9 n2  8

Calculate a 95% CI for 1 - 2 where


1 = mean points per game allowed by Duke at home.
2 = mean points per game allowed by Duke on road
2 2 2 2
• n1 = 8, n2 = 8; s1 = (21.8) = 475.36; s2 = (8.9) =
2
s s  2 2
 475.36 79.41 
2

  1

2
  
df   n1 n2
  8 8 
 9.27
2 2 2 2
1  s12  1  s22  1  475.36   1  79.41 
    7 8  7 8 
n1  1  n1  n2  1  n2     

Chi-Square Test for Independence:

This lesson explains how to conduct a chi-square test for independence. The test is applied
when you have two categorical variables from a single population. It is used to determine
whether there is a significant association between the two variables.

When to Use Chi-Square Test for Independence

The test procedure described in this lesson is appropriate when the following conditions are met:

 The sampling method is simple random sampling.


 The variables under study are each categorical.
 If sample data are displayed in a contingency table, the expected frequency count for each cell of
the table is at least 5.

State the Hypotheses

Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states that
knowing the level of Variable A does not help you predict the level of Variable B. That is, the
variables are independent.

H0: Variable A and Variable B are independent.


Ha: Variable A and Variable B are not independent.

The alternative hypothesis is that knowing the level of Variable A can help you predict the level
of Variable B.

Formula:

Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ]


Problem

A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were
classified by gender (male or female) and by voting preference (Republican, Democrat, or
Independent). Results are shown in the contingency table below.

Voting Preferences
Row total
Republican Democrat Independent

Male 200 150 50 400

Female 250 300 50 600

Column total 450 450 100 1000

Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.

Solution

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:

 State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.

H0: Gender and voting preferences are independent.


Ha: Gender and voting preferences are not independent.

 Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.
 Analyze sample data. Applying the chi-square test for independence to sample data,
we compute the degrees of freedom, the expected frequency counts, and the chi-
square test statistic. Based on the chi-square statistic and the degrees of freedom, we
determine the P-value.
 DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
E1,3 = (400 * 100) / 1000 = 40000/1000 = 40
E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60

2 2
Χ = Σ [ (Or,c - Er,c) / Er,c ]
2 2 2 2
Χ = (200 - 180) /180 + (150 - 180) /180 + (50 - 40) /40
2 2 2
+ (250 - 270) /270 + (300 - 270) /270 + (50 - 60) /60
2
Χ = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
2
Χ = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
 where DF is the degrees of freedom, r is the number of levels of gender, c is the number
of levels of the voting preference, nr is the number of observations from level r of gender,
nc is the number of observations from level c of voting preference, n is the number of
observations in the sample, Er,c is the expected frequency count when gender is
level r and voting preference is level c, and Or,c is the observed frequency count when
gender is level r voting preference is level c.

The P-value is the probability that a chi-square statistic having 2 degrees of freedom is
more extreme than 16.2.

We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) = 0.0003.

 Interpret results. Since the P-value


(0.0003) is less than the significance level (0.05), we
cannot accept the null hypothesis. Thus, we conclude that there is a relationship between
gender and voting preference.

Chi-Square Goodness of Fit Test:


This lesson explains how to conduct a chi-square goodness of fit test. The test is applied when
you have one categorical variable from a single population. It is used to determine whether
sample data are consistent with a hypothesized distribution.
When to Use the Chi-Square Goodness of Fit Test

The chi-square goodness of fit test is appropriate when the following conditions are met:

 The sampling method is simple random sampling.


 The variable under study is categorical.
 The expected value of the number of sample observations in each level of the variable is
at least 5.

State the Hypotheses

Every hypothesis test requires the analyst to state a null hypothesis (H0) and an alternative
hypothesis (Ha). The hypotheses are stated in such a way that they are mutually exclusive. That
is, if one is true, the other must be false; and vice versa.

For a chi-square goodness of fit test, the hypotheses take the following form.

H0: The data are consistent with a specified distribution.


Ha: The data are not consistent with a specified distribution.

Analyze Sample Data

Using sample data, find the degrees of freedom, expected frequency counts, test statistic, and the
P-value associated with the test statistic.

 Degrees of freedom. The degrees of freedom (DF) is equal to the number of levels (k) of the
categorical variable minus 1: DF = k - 1 .

 Expected frequency counts. The expected frequency counts at each level of the categorical
variable are equal to the sample size times the hypothesized proportion from the null hypothesis

Ei = npi

where Ei is the expected frequency count for the ith level of the categorical variable, n is the total
sample size, and pi is the hypothesized proportion of observations in level i.

2
 Test statistic. The test statistic is a chi-square random variable (Χ ) defined by the following
equation.

2 2
Χ = Σ [ (Oi - Ei) / Ei ]
where Oi is the observed frequency count for the ith level of the categorical variable, and Ei is the
expected frequency count for the ith level of the categorical variable.

 P-value. The P-value is the probability of observing a sample statistic as extreme as the test
statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution Calculator to
assess the probability associated with the test statistic. Use the degrees of freedom computed
above

Problem

 Acme Toy Company prints baseball cards. The company claims that 30% of the cards are rookies,
60% veterans, and 10% are All-Stars.
 Suppose a random sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this
consistent with Acme's claim? Use a 0.05 level of significance.

Solution

The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3)
analyze sample data, and (4) interpret results. We work through those steps below:

 State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
 Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%, 60% and 10%,
respectively.
 Alternative hypothesis: At least one of the proportions in the null hypothesis is false.

 Formulate an analysis plan. For this analysis, the significance level is 0.05. Using sample data, we
will conduct a chi-square goodness of fit test of the null hypothesis.
 Analyze sample data. Applying the chi-square goodness of fit test to sample data, we compute
the degrees of freedom, the expected frequency counts, and the chi-square test statistic. Based
on the chi-square statistic and the degrees of freedom, we determine the P-value.

DF = k - 1 = 3 - 1 = 2

(Ei) = n * pi
(E1) = 100 * 0.30 = 30
(E2) = 100 * 0.60 = 60
(E3) = 100 * 0.10 = 10

2 2
Χ = Σ [ (Oi - Ei) / Ei ]
2 2 2 2
Χ = [ (50 - 30) / 30 ] + [ (45 - 60) / 60 ] + [ (5 - 10) / 10 ]
2
Χ = (400 / 30) + (225 / 60) + (25 / 10) = 13.33 + 3.75 + 2.50 = 19.58

where DF is the degrees of freedom, k is the number of levels of the categorical variable, n is the
number of observations in the sample, Ei is the expected frequency count for level i, Oi is the
2
observed frequency count for level i, and Χ is the chi-square test statistic.

The P-value is the probability that a chi-square statistic having 2 degrees of freedom is more
extreme than 19.58.

2
We use the Chi-Square Distribution Calculator to find P(Χ > 19.58) = 0.0001.

 Interpret results. Since the P-value (0.0001) is less than the significance level (0.05), we cannot
accept the null hypothesis.

You might also like