Simple Linear Regression Analysis
Simple Linear Regression Analysis
Simple Linear Regression Analysis
A linear regression model attempts to explain the relationship between two or more variables
using a straight line. Consider the data obtained from a chemical process where the yield of the
process is thought to be related to the reaction temperature (see the table below).
Regression Formula:
Regression Equation(y) = a + bx
Where,
X and y are the variables.
X = First Score
Y = Second Score
Regression Example:
X Values Y Values
60 3.1
61 3.6
62 3.8
63 4
65 4.1
To find regression equation, we will first find slope, intercept and use it to form regression equation.
Step 1:
Count the number of values. N = 5
Step 2:
Find XY, X2 See the below table
63 4 63 * 4 = 252 63 * 63 = 3969
Step 3:
Find ΣX, ΣY, ΣXY, ΣX2.
ΣX = 311 ΣY = 18.6 ΣXY = 1159.7 ΣX2 = 19359
Step 4:
Substitute in the above slope formula given.
Slope (b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)
= ((5)*(1159.7)-(311)*(18.6))/((5)*(19359)-(311)2)
= (5798.5 - 5784.6)/ (96795 - 96721)
= 13.9/74 = 0.19
Step 5:
Now, again substitute in the above intercept formula given.
= (18.6 - 0.19(311))/5
= (18.6 - 59.09)/5
= -40.49/5
= -8.098
Step 6:
Then substitute these values in regression equation formula Regression Equation(y) = a + bx
= -8.098 + 0.19x.
Suppose if we want to know the approximate y value for the variable x = 64. Then we can substitute the
value in the above equation.
Regression Equation(y) = a + bx
= -8.098 + 0.19(64).
= -8.098 + 12.16
= 4.06
The following sections discuss hypothesis tests on the regression coefficients in simple linear
regression. These tests can be carried out if it can be assumed that the random error term, , is
normally and independently distributed with a mean of zero and variance of .
t Tests
The tests are used to conduct hypothesis tests on the regression coefficients obtained in simple
linear regression. A statistic based on the distribution is used to test the two-sided hypothesis
that the true slope, , equals some constant value, . The statements for the hypothesis test
are expressed as:
where is the least square estimate of , and is its standard error. The value of can
be calculated as follows:
The test statistic, , follows a distribution with degrees of freedom, where is the total
number of observations.
Example
The test for the significance of regression for the data in the preceding table is illustrated in this
example. The test is carried out using the test on the coefficient . The hypothesis to be
tested is . To calculate the statistic to test , the estimate, , and the standard
error, , are needed. The value of was obtained in this section. The standard error can
be calculated as follows:
Then, the test statistic can be calculated using the following equation:
The value corresponding to this statistic based on the distribution with 23 (n-2 = 25-2 = 23) degrees
of freedom can be obtained as follows:
Assuming that the desired significance level is 0.1, since value < 0.1, is rejected
indicating that a relation exists between temperature and yield for the data in the preceding table.
Correlation Analysis
ranges between -1 and +1 and quantifies the direction and strength of the linear
association between the two variables. The correlation between two variables can be positive
(i.e., higher levels of one variable are associated with higher levels of the other) or negative (i.e.,
higher levels of one variable are associated with lower levels of the other).
The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.
For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.
Next, we summarize the birth weight data. The mean birth weight is:
The variance of birth weight is computed just as we did for gestational age as shown in the table
below.
The variance of birth weight is:
To compute the covariance of gestational age and birth weight, we need to multiply the
deviation from the mean gestational age by the deviation from the mean birth weight for each
participant (i.e.,
The computations are summarized below. Notice that we simply copy the deviations from the
mean gestational age and birth weight from the two tables above into the table below and
multiply.
The covariance of gestational age and birth weight is:
Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.
n≥30 ̅ ̅
√ √
n<30 ̅ ̅
√ √
Eample:
n = 8, = 0.2, s = 0.07, α = 0.05, df = n − 1 = 7. From Table 6, t α/2 = t0.025 = 2.365. Therefore, the confidence
interval is
Solution:
̅
√
0.2 2.365
√
0.2 0.059
(0.141, 0.259)
The table below summarizes data n=3,539 participants attending the 7th examination of the
Offspring cohort in the Framingham Heart Study.
Men Women
Characteristic
N s n s
Systolic Blood 1,623 128.2 17.5 1,911 126.5 20.1
Pressure
Men Women
Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study
there were 1,219 participants being treated for hypertension and 2,313 who were not on
treatment. If we call treatment a "success", then x=1,219 and n=3,532. The sample proportion
is
Example: During the 7th examination of the Offspring cohort in the Framingham Heart Study
there were 1,219 participants being treated for hypertension and 2,313 who were not on
treatment. If we call treatment a "success", then x=1,219 and n=3,532. The sample proportion
is
= x/n = 1,219/3,532 = 0.345
This is the point estimate, i.e., our best estimate of the proportion of the population on treatment
for hypertension is 34.5%. The sample is large, so the confidence interval can be computed
using the formula:
Confidence interval
s2 s2
( x x ) tdf * 1 2
1 2 n n
1 2
where tdf * is the value from the t-table
that corresponds to the confidence level
2
s2
s 2
1 2
df n1 n2
2 2
1 s1 2
1 s2
2
n1 1 n1 n2 1 n2
Question:
1
2
df n1 n2
8 8
9.27
2 2 2 2
1 s12 1 s22 1 475.36 1 79.41
7 8 7 8
n1 1 n1 n2 1 n2
This lesson explains how to conduct a chi-square test for independence. The test is applied
when you have two categorical variables from a single population. It is used to determine
whether there is a significant association between the two variables.
The test procedure described in this lesson is appropriate when the following conditions are met:
Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states that
knowing the level of Variable A does not help you predict the level of Variable B. That is, the
variables are independent.
The alternative hypothesis is that knowing the level of Variable A can help you predict the level
of Variable B.
Formula:
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were
classified by gender (male or female) and by voting preference (Republican, Democrat, or
Independent). Results are shown in the contingency table below.
Voting Preferences
Row total
Republican Democrat Independent
Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results. We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample data,
we compute the degrees of freedom, the expected frequency counts, and the chi-
square test statistic. Based on the chi-square statistic and the degrees of freedom, we
determine the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
Er,c = (nr * nc) / n
E1,1 = (400 * 450) / 1000 = 180000/1000 = 180
E1,2 = (400 * 450) / 1000 = 180000/1000 = 180
E1,3 = (400 * 100) / 1000 = 40000/1000 = 40
E2,1 = (600 * 450) / 1000 = 270000/1000 = 270
E2,2 = (600 * 450) / 1000 = 270000/1000 = 270
E2,3 = (600 * 100) / 1000 = 60000/1000 = 60
2 2
Χ = Σ [ (Or,c - Er,c) / Er,c ]
2 2 2 2
Χ = (200 - 180) /180 + (150 - 180) /180 + (50 - 40) /40
2 2 2
+ (250 - 270) /270 + (300 - 270) /270 + (50 - 60) /60
2
Χ = 400/180 + 900/180 + 100/40 + 400/270 + 900/270 + 100/60
2
Χ = 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2
where DF is the degrees of freedom, r is the number of levels of gender, c is the number
of levels of the voting preference, nr is the number of observations from level r of gender,
nc is the number of observations from level c of voting preference, n is the number of
observations in the sample, Er,c is the expected frequency count when gender is
level r and voting preference is level c, and Or,c is the observed frequency count when
gender is level r voting preference is level c.
The P-value is the probability that a chi-square statistic having 2 degrees of freedom is
more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) = 0.0003.
The chi-square goodness of fit test is appropriate when the following conditions are met:
Every hypothesis test requires the analyst to state a null hypothesis (H0) and an alternative
hypothesis (Ha). The hypotheses are stated in such a way that they are mutually exclusive. That
is, if one is true, the other must be false; and vice versa.
For a chi-square goodness of fit test, the hypotheses take the following form.
Using sample data, find the degrees of freedom, expected frequency counts, test statistic, and the
P-value associated with the test statistic.
Degrees of freedom. The degrees of freedom (DF) is equal to the number of levels (k) of the
categorical variable minus 1: DF = k - 1 .
Expected frequency counts. The expected frequency counts at each level of the categorical
variable are equal to the sample size times the hypothesized proportion from the null hypothesis
Ei = npi
where Ei is the expected frequency count for the ith level of the categorical variable, n is the total
sample size, and pi is the hypothesized proportion of observations in level i.
2
Test statistic. The test statistic is a chi-square random variable (Χ ) defined by the following
equation.
2 2
Χ = Σ [ (Oi - Ei) / Ei ]
where Oi is the observed frequency count for the ith level of the categorical variable, and Ei is the
expected frequency count for the ith level of the categorical variable.
P-value. The P-value is the probability of observing a sample statistic as extreme as the test
statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution Calculator to
assess the probability associated with the test statistic. Use the degrees of freedom computed
above
Problem
Acme Toy Company prints baseball cards. The company claims that 30% of the cards are rookies,
60% veterans, and 10% are All-Stars.
Suppose a random sample of 100 cards has 50 rookies, 45 veterans, and 5 All-Stars. Is this
consistent with Acme's claim? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3)
analyze sample data, and (4) interpret results. We work through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
Null hypothesis: The proportion of rookies, veterans, and All-Stars is 30%, 60% and 10%,
respectively.
Alternative hypothesis: At least one of the proportions in the null hypothesis is false.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using sample data, we
will conduct a chi-square goodness of fit test of the null hypothesis.
Analyze sample data. Applying the chi-square goodness of fit test to sample data, we compute
the degrees of freedom, the expected frequency counts, and the chi-square test statistic. Based
on the chi-square statistic and the degrees of freedom, we determine the P-value.
DF = k - 1 = 3 - 1 = 2
(Ei) = n * pi
(E1) = 100 * 0.30 = 30
(E2) = 100 * 0.60 = 60
(E3) = 100 * 0.10 = 10
2 2
Χ = Σ [ (Oi - Ei) / Ei ]
2 2 2 2
Χ = [ (50 - 30) / 30 ] + [ (45 - 60) / 60 ] + [ (5 - 10) / 10 ]
2
Χ = (400 / 30) + (225 / 60) + (25 / 10) = 13.33 + 3.75 + 2.50 = 19.58
where DF is the degrees of freedom, k is the number of levels of the categorical variable, n is the
number of observations in the sample, Ei is the expected frequency count for level i, Oi is the
2
observed frequency count for level i, and Χ is the chi-square test statistic.
The P-value is the probability that a chi-square statistic having 2 degrees of freedom is more
extreme than 19.58.
2
We use the Chi-Square Distribution Calculator to find P(Χ > 19.58) = 0.0001.
Interpret results. Since the P-value (0.0001) is less than the significance level (0.05), we cannot
accept the null hypothesis.