Regression I
Regression I
Plot of
Advertising
Data Set
Testing for Significant Correlation
• We need to be able to determine whether the relationship implied
by the sample correlation coefficient is real or due to chance.
𝐻0 : 𝜌𝑋𝑌 = 0 𝑣𝑠 𝐻1 : 𝜌𝑋𝑌 ≠ 0.
Test Statistic
• The test statistic is
𝑟𝑋𝑌
𝑡= ,
𝑠𝑟
where 𝑡 follows a 𝑡 distribution with 𝑑𝑓 = 𝑛 − 2 under 𝐻0 and
2 Τ
𝑠𝑟 = 1 − 𝑟𝑋𝑌 𝑛−2
is the standard error of 𝑟𝑋𝑌 .
Correlation Analysis for Advertising Data
Variables Correlation Obs. Test 𝒑-value Decision
Statistic
TV and Sales 0.78 17.67 0.00 Reject 𝐻0
Radio and Sales 0.35 9.92 0.00 Reject 𝐻0
Newspaper and 0.23 3.30 0.00 Reject 𝐻0
Sales
• Source: http://www-bcf.usc.edu/~gareth/ISL/
How to Get the Estimates of the Coefficients?
• Data: 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 .
• Let 𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 be the prediction for 𝑌 based on the 𝑖-th value of 𝑋.
• Residual: 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 .
• We define the residual sum of squares (𝑅𝑆𝑆) or Sum of Squares due to Errors
(SSE) as
𝑅𝑆𝑆 = 𝑒12 + ⋯ + 𝑒𝑛2 .
• We minimize 𝑅𝑆𝑆 to get the estimates of the coefficients.
Summary of Regression Analysis of Sales on TV
Regression Equation
= 7.0325 + 0.0475 × 𝑇𝑉
𝑆𝑎𝑙𝑒𝑠
Interpretations
• An increase of $1000 in the TV advertising budget is associated with an increase
in average sales by around 48 units.
• Accuracy of the estimated coefficients can be measured from their respective
standard errors.
• Standard errors can be used to construct the confidence intervals.
• A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter.
Interpretations
• For linear regression, the 95% confidence interval for 𝛽1 approximately takes the
form
𝛽መ1 ± 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1 .
• That is, there is approximately a 95% chance that the interval
𝛽መ1 − 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1 , 𝛽መ1 + 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1
will contain the true value of 𝛽1 .
• Similarly, confidence interval for 𝛽0 approximately takes the form
𝛽መ0 ± 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ0 .
Interpretations
• In the case of the advertising data, the 95% confidence interval for 𝛽0 is
[6.130,7.935] and the 95% confidence interval for 𝛽1 is [0.042,0.053].
• Therefore, we can conclude that in the absence of any advertising, sales will, on
average, fall somewhere between 6,130 and 7,935 units.
• Furthermore, for each $1,000 increase in television advertising, there will be an
average increase in sales of between 42 and 53 units.
Interpretations
• Standard Errors can also be used to perform hypothesis tests on the
coefficients.
• The most common hypothesis test involves testing the null hypothesis of
𝐻0 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌
versus the alternative
𝐻1 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑠𝑜𝑚𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌.
• Mathematically, this corresponds to testing
𝐻0 ∶ 𝛽1 = 0 𝑣𝑠 𝐻1 ∶ 𝛽1 ≠ 0.
Interpretations
• The test statistic for testing the hypothesis is given by
𝛽መ1 − 0
𝑡= .
𝑆𝐸(𝛽መ1 )
• Under 𝐻0 , the above test statistic follows a 𝑡-distribution with 𝑛 − 2 degrees of
freedom.
• We reject the null hypothesis if the 𝑝-value is small enough.
Interpretations
• The coefficients are very large relative to their standard errors, so the 𝑡-
statistics are also large.
• The probabilities of seeing such values if 𝐻0 is true are virtually zero. Hence we
can conclude that 𝛽0 ≠ 0 and 𝛽1 ≠ 0.
• This clearly suggests that there is a significant relationship between TV and
sales.
Assessing the Accuracy of the Model
• Once the null hypothesis is rejected in favour of the alternative hypothesis, it is
natural to quantify the extent to which the model fits the data.
• The quality of a linear regression fit is assessed using two related quantities:
➢ Residual Standard Error or Standard Error of the Estimate
➢ 𝑅2 statistic
Residual Standard Error (𝑅𝑆𝐸)
•The 𝑅𝑆𝐸 is an estimate of the standard deviation of 𝜖.
•Roughly, it is the average amount by which the response will deviate from the
true regression line.
•It is computed using the formula
𝑛
1 1 2.
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = 𝑦𝑖 − 𝑦ො𝑖
𝑛−2 𝑛−2
𝑖=1