0% found this document useful (0 votes)
85 views41 pages

Regression I

This document discusses linear regression analysis of advertising data to determine the relationship between advertising budgets (TV, radio, newspaper) and sales. It contains a dataset with advertising expenditures and sales for 200 markets. Key points discussed include: - Performing correlation analysis to see if relationships exist between each ad medium and sales. Significant positive correlations were found. - Using simple linear regression to model the relationship between TV ad spending and sales, which allows estimating the effect of TV on sales and predicting future sales based on TV budget. - Interpreting the regression coefficients and assessing the accuracy and significance of the linear regression model.

Uploaded by

Srijan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views41 pages

Regression I

This document discusses linear regression analysis of advertising data to determine the relationship between advertising budgets (TV, radio, newspaper) and sales. It contains a dataset with advertising expenditures and sales for 200 markets. Key points discussed include: - Performing correlation analysis to see if relationships exist between each ad medium and sales. Significant positive correlations were found. - Using simple linear regression to model the relationship between TV ad spending and sales, which allows estimating the effect of TV on sales and predicting future sales based on TV budget. - Interpreting the regression coefficients and assessing the accuracy and significance of the linear regression model.

Uploaded by

Srijan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Linear Regression

Advertising Data Set


The Advertising data set consists of the sales (in
thousands of units) of a particular product in 200
different markets.

It also contains the advertising budgets (in thousands


of dollars) for the product in each of the markets for
three different media: TV, Radio, and Newspaper
Advertising Data Set
Market ID TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1 4.8
Important Questions for an Effective Market
Plan

Is there any relationship between advertising budget and sales?


1

How strong is the relationship between advertising budget and


2 sales?

Which media contribute to sales?


3
Important Questions for an Effective Market
Plan

How accurately can we estimate the effect of each medium on


4 sales?

How accurately can we predict the future sales?


5
Correlation Analysis
Plot of
Advertising
Data Set
𝑟 = 0.78 𝑟 = 0.35 𝑟 = 0.23

Plot of
Advertising
Data Set
Testing for Significant Correlation
• We need to be able to determine whether the relationship implied
by the sample correlation coefficient is real or due to chance.

• In other words, we would like to test whether the population


correlation coefficient 𝜌𝑋𝑌 between two variables 𝑋 and 𝑌 is
different from zero, i.e.,

𝐻0 : 𝜌𝑋𝑌 = 0 𝑣𝑠 𝐻1 : 𝜌𝑋𝑌 ≠ 0.
Test Statistic
• The test statistic is

𝑟𝑋𝑌
𝑡= ,
𝑠𝑟
where 𝑡 follows a 𝑡 distribution with 𝑑𝑓 = 𝑛 − 2 under 𝐻0 and

2 Τ
𝑠𝑟 = 1 − 𝑟𝑋𝑌 𝑛−2
is the standard error of 𝑟𝑋𝑌 .
Correlation Analysis for Advertising Data
Variables Correlation Obs. Test 𝒑-value Decision
Statistic
TV and Sales 0.78 17.67 0.00 Reject 𝐻0
Radio and Sales 0.35 9.92 0.00 Reject 𝐻0
Newspaper and 0.23 3.30 0.00 Reject 𝐻0
Sales

Thus, we can conclude that there is indeed significant relationship between TV


Ad Budget and Sales, Radio Ad Budget and Sales, and Newspaper Ad Budget
and Sales.
Limitations of Correlation Analysis
• The correlation coefficient captures only a linear relationship.
• The correlation coefficient may not be a reliable measure in the
presence of outliers.
• Even if two variables are highly correlated, one does not necessarily
cause the other.
Regression Analysis
• With regression analysis, we explicitly assume how one variable,
called the response variable, is influenced by other variables, called
the explanatory variables.
• Using regression analysis, we may predict the response variable
given values for our explanatory variables.
Inexact Relationship
• If the value of the response variable is uniquely determined by the values of the
explanatory variables, we say that the relationship is deterministic.
• But if, as we find in most fields of research, that the relationship is inexact due
to omission of relevant factors, we say that the relationship is inexact.
• In regression analysis, we include a stochastic error term, that acknowledges
that the actual relationship between the response and explanatory variables is
not deterministic.
Notation
• Output Variable/ Response Variable/ Dependent Variable
➢Sales (𝑌)
• Input Variables/ Predictors/ Independent Variables/ Features/ Variables
➢ TV budget 𝑋1
➢Radio budget 𝑋2
➢Newspaper budget 𝑋3
Simple Linear Regression
Simple Linear Regression (SLR)
• Single Predictor (𝑋)
• The SLR model is given by
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖, … … … 1
where 𝜖 is mean-zero random error term.
• For example, let 𝑋 represent TV advertising budget and 𝑌 represent sales, then
the SLR model is given by
Sales = 𝛽0 + 𝛽1 × TV + 𝜖.
Simple Linear Regression (SLR)
• In Equation (1), 𝛽0 and 𝛽1 are two unknown constants that
represent the intercept and slope terms in the linear model.
• Together they are called the model coefficients or parameters.
How to Get the Estimates of the Coefficients?
• In the advertising data set, we have data on the TV advertising budget and
product sales in 𝑛 = 200 different markets.
• Our job is to find the coefficients in such a way that the resulting line is as close
as possible to the 𝑛 = 200 data points.
• How should we measure the closeness?
• The most common approach involves minimizing the least squares criterion.
How to Get the Estimates of the Coefficients?

• Source: http://www-bcf.usc.edu/~gareth/ISL/
How to Get the Estimates of the Coefficients?
• Data: 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 , 𝑦𝑛 .
• Let 𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖 be the prediction for 𝑌 based on the 𝑖-th value of 𝑋.
• Residual: 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 .
• We define the residual sum of squares (𝑅𝑆𝑆) or Sum of Squares due to Errors
(SSE) as
𝑅𝑆𝑆 = 𝑒12 + ⋯ + 𝑒𝑛2 .
• We minimize 𝑅𝑆𝑆 to get the estimates of the coefficients.
Summary of Regression Analysis of Sales on TV

Coefficient Std. error t-statistic p-value

Intercept 7.0325 0.4578 15.36 <0.0001

TV 0.0475 0.0027 17.67 <0.0001


Summary of Regression Analysis of Sales on TV

Coefficient Std. error t-statistic p-value

Intercept 7.0325 0.4578 15.36 <0.0001

TV 0.0475 0.0027 17.67 <0.0001

Regression Equation
෣ = 7.0325 + 0.0475 × 𝑇𝑉
𝑆𝑎𝑙𝑒𝑠
Interpretations
• An increase of $1000 in the TV advertising budget is associated with an increase
in average sales by around 48 units.
• Accuracy of the estimated coefficients can be measured from their respective
standard errors.
• Standard errors can be used to construct the confidence intervals.
• A 95% confidence interval is defined as the range of values such that with 95%
probability, the range will contain the true unknown value of the parameter.
Interpretations
• For linear regression, the 95% confidence interval for 𝛽1 approximately takes the
form
𝛽መ1 ± 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1 .
• That is, there is approximately a 95% chance that the interval
𝛽መ1 − 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1 , 𝛽መ1 + 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ1
will contain the true value of 𝛽1 .
• Similarly, confidence interval for 𝛽0 approximately takes the form
𝛽መ0 ± 𝑡0.025,𝑛−2 × 𝑆𝐸 𝛽መ0 .
Interpretations
• In the case of the advertising data, the 95% confidence interval for 𝛽0 is
[6.130,7.935] and the 95% confidence interval for 𝛽1 is [0.042,0.053].
• Therefore, we can conclude that in the absence of any advertising, sales will, on
average, fall somewhere between 6,130 and 7,935 units.
• Furthermore, for each $1,000 increase in television advertising, there will be an
average increase in sales of between 42 and 53 units.
Interpretations
• Standard Errors can also be used to perform hypothesis tests on the
coefficients.
• The most common hypothesis test involves testing the null hypothesis of
𝐻0 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌
versus the alternative
𝐻1 ∶ 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑠𝑜𝑚𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌.
• Mathematically, this corresponds to testing

𝐻0 ∶ 𝛽1 = 0 𝑣𝑠 𝐻1 ∶ 𝛽1 ≠ 0.
Interpretations
• The test statistic for testing the hypothesis is given by
𝛽መ1 − 0
𝑡= .
𝑆𝐸(𝛽መ1 )
• Under 𝐻0 , the above test statistic follows a 𝑡-distribution with 𝑛 − 2 degrees of
freedom.
• We reject the null hypothesis if the 𝑝-value is small enough.
Interpretations
• The coefficients are very large relative to their standard errors, so the 𝑡-
statistics are also large.
• The probabilities of seeing such values if 𝐻0 is true are virtually zero. Hence we
can conclude that 𝛽0 ≠ 0 and 𝛽1 ≠ 0.
• This clearly suggests that there is a significant relationship between TV and
sales.
Assessing the Accuracy of the Model
• Once the null hypothesis is rejected in favour of the alternative hypothesis, it is
natural to quantify the extent to which the model fits the data.
• The quality of a linear regression fit is assessed using two related quantities:
➢ Residual Standard Error or Standard Error of the Estimate
➢ 𝑅2 statistic
Residual Standard Error (𝑅𝑆𝐸)
•The 𝑅𝑆𝐸 is an estimate of the standard deviation of 𝜖.
•Roughly, it is the average amount by which the response will deviate from the
true regression line.
•It is computed using the formula
𝑛
1 1 2.
𝑅𝑆𝐸 = 𝑅𝑆𝑆 = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑛−2 𝑛−2
𝑖=1

•The 𝑅𝑆𝐸 is considered a measure of the lack of fit.


Residual Standard Error (𝑅𝑆𝐸)
• In case of the advertising data, the RSE is 3.26.
• This means that the actual sales in each market deviate from the true regression line by
approximately 3,260 units, on average.
• Even if the model were correct and the true values of the unknown coefficients 𝛽0 and
𝛽1 were known exactly, any prediction of sales on the basis of TV advertising would still
be off by about 3,260 units on average.
• The next question is whether or not 3,260 units is an acceptable prediction error.
Residual Standard Error (𝑅𝑆𝐸)
• In the advertising data set, the mean value of sales over all markets is
approximately 14,000 units.
• The percentage error is 3,260/14,000 = 23%.
𝑅 2 Statistic
• The RSE provides an absolute measure of lack of fit of the model to the data.
• Since it is measured in the units of 𝑌, it is not always clear what constitutes a
good RSE.
• The 𝑅2 statistic provides an alternative measure of fit.
• It takes the form of a proportion—the proportion of variance explained—and so
it always takes on a value between 0 and 1, and is independent of the scale of Y .
𝑅 2 Statistic
• To calculate 𝑅2 , we use the formula
𝑇𝑆𝑆 − 𝑅𝑆𝑆 𝑅𝑆𝑆
𝑅2 = =1− ,
𝑇𝑆𝑆 𝑇𝑆𝑆
where 𝑇𝑆𝑆 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 is the total sum of squares.
• 𝑇𝑆𝑆 measures the amount of variability inherent in the response before the regression is
performed.
• In contrast, 𝑅𝑆𝑆 measures the amount of variability that is left unexplained after
performing the regression.
•Hence, 𝑇𝑆𝑆 − 𝑅𝑆𝑆 measures the amount of variability in the response that is explained (or
removed) by performing the regression, and 𝑅2 measures the proportion of variability in 𝑌
that can be explained using 𝑋.
𝑅 2 Statistic
• An 𝑅2 statistic that is close to 1 indicates that a large proportion of the variability in the
response has been explained by the regression.
• A number near 0 indicates that the regression did not explain much of the variability in
the response.
• This might occur because the linear model is wrong, or the inherent error 𝜎 2 is high, or
both.
• In the advertising data set, the 𝑅2 was 0.61, and so just under two-thirds of the
variability in sales is explained by a linear regression on TV alone.
𝑅 2 Statistic
• The 𝑅2 statistic has an interpretational advantage over the RSE, since unlike the
RSE, it always lies between 0 and 1.
• However, it can still be challenging to determine what is a good 𝑅2 value, and in
general, this will depend on the application.
• In the simple linear regression setting, 𝑅2 = 𝑟 2 .
• Thus 𝑅2 can work as a measure of linear relationship between 𝑋 and 𝑌.
Summary of Regression Analysis of Sales on
Radio
Coefficient Std. error t-statistic p-value
Intercept 9.312 0.563 16.54 <0.0001
Radio 0.203 0.020 9.92 <0.0001
Summary of Regression Analysis of Sales on
Newspaper
Coefficient Std. error t-statistic p-value
Intercept 12.351 0.621 19.88 <0.0001
Newspaper 0.055 0.017 3.30 0.0012
Summary of Three Simple Linear Regression
Models
Model Predictors 𝑹𝟐 𝑹𝑺𝑬
1 TV 0.62 3.26
2 Radio 0.33 4.28
3 Newspaper 0.05 5.09
Fitted
Regression
Lines

You might also like