0% found this document useful (0 votes)
5 views

AS lecture 06 (Simple linear Regression)

Uploaded by

amiraziz.uet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

AS lecture 06 (Simple linear Regression)

Uploaded by

amiraziz.uet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

SIMPLE LINEAR

REGRESSION
Lecture # 06

Dr. Imran Khalil


[email protected]
Contents
• Simple Linear Regression
• Two Main Objectives
• Variable’s Roles
• A Linear Equation
• How to Perform Simple Linear Regression
• Regression Analysis
• 𝑡-statistics or 𝑡 ratio
• Coefficient of Determination (𝑅2 )
• Coefficient of Correlation (𝑟)
• Practice Problems
2
Simple Linear Regression
• Simple linear regression is used to estimate the relationship
between two quantitative variables. You can use simple linear
regression when you want to know:
• How strong the relationship is between two variables (e.g., the
relationship between rainfall and soil erosion).
• The value of the dependent variable at a certain value of
the independent variable (e.g., the amount of soil erosion at a
certain level of rainfall).
• Regression models describe the relationship between variables by
fitting a line to the observed data. Linear regression models use a
straight line.
• Regression allows you to estimate how a dependent variable changes as
the independent variable(s) change.
3
Simple Linear Regression Example
• You are a social researcher interested in the relationship between
income and happiness. You survey 500 people whose incomes
range from 15k to 75k and ask them to rank their happiness on a
scale from 1 to 10.
• Your independent variable (income) and dependent variable
(happiness) are both quantitative, so you can do a regression
analysis to see if there is a linear relationship between them.

4
Two Main Objectives
• Establish if there is a relationship between two variables.
• More specifically, establish if there is a statistically significant
relationship between two variables.
• Examples: Income and spending, wage and gender, student height and
exam scores.
• Forecast new observations.
• Can we use what we know about the relationship to forecast unobserved
values?
• Examples: What will are sales be over the next quarter? What will be the
effect of advertising over sales?

5
Variable’s Roles
Dependent Variable Independent Variable
• This is the variable whose • This is the variable that
values we want the explain explain the other one.
or forecast. • Its values are independent.
• Its values depend on • We denote it as 𝒙.
something else.
• We denote it as 𝒚

6
A Linear Equation
• You may remember one of these:
• 𝑦 = 𝑎 + 𝑏𝑥
• 𝑦 = 𝑚𝑥 + 𝑏
• In the stats world, we just a different notations:
• 𝑦 = 𝛽0 + 𝛽1 𝑥
• We call it “linear” because the equation represents a straight line

7
How to Perform a Simple Linear Regression
Simple Linear Regression Formula
𝒚 = 𝒂 + 𝒃𝒙
• 𝒚 is the predicted value of the dependent variable for the given
value of the independent variable 𝑥.
• 𝒂 is the intercept, the predicted value of 𝑦 when the 𝑥 is 0.
• 𝒃 is the slope or regression coefficient - how much we expect 𝑦 to
change as 𝑥 increases.
• 𝒙 is the independent variable (the variable we expect is
influencing 𝑦)

8
Regression Analysis: Example
• Suppose that a manager wants to determine the relationship
between the firm’s advertising expenditures and its sales revenue.
The manager wants to test the hypothesis that higher advertising
expenditures lead to higher sales for the firm, and, furthermore,
she wants to estimate the strength of the relationship (i.e., how
much sales increase for each dollar increase in advertising
expenditures).
• The manager collects data on advertising expenditures and on
sales revenue for the firm over the past 10 years.

9
Advertising Expenditures and Sales Revenues of the
Firm in Each of 10 Years Scatter Diagram
Year X Y
1 10 44
2 9 40
3 11 42
4 12 46
5 11 48
6 12 52
7 13 54
8 13 58
9 14 56
10 15 60
10
𝐻0 and 𝐻1 Hypothesis

• Null hypothesis (𝑯𝟎 ): There is no relationship between


X (advertising) and Y (sales).

• Alternative hypothesis (𝑯𝑨 ): There is a significant


relationship between X (advertising) and Y (sales).

11
Regression Analysis
Regression Line: Line of
Best Fit:
Draw the line, by
visual inspection, the
positively sloped straight
line that “best” fits
between the data points
(so that the data points
are about equally distant
on either side of the
line).

12
Least Square Method
• Rough estimate of the linear relationship between the firm’s sales
revenue (𝑌) and its advertising expense (𝑋) in the form of
𝒀 = 𝒂 + 𝒃𝑿
• 𝒂 is the vertical intercept of the estimated linear relationship and
gives the value of 𝑌 when 𝑋 = 0, while
• 𝒃 is the slope of the line and gives an estimate of the increase in 𝑌
resulting from each unit increase in 𝑋.

13
Problem – Best Fit Line
• The difficulty with the visual fitting of the line to the data
points in the figure is that different researchers would
probably fit a somewhat different line to the same data points
and obtain somewhat different results.
• Regression analysis is a statistical technique for obtaining the
line that best fits the data points so that all researchers
looking at the same data would get exactly the same result.
• Regression Line is the line obtained by minimizing the sum of
the squared vertical deviations of each point from the
regression line. This method is, therefore, appropriately called
the “ordinary least square method.

14
Ordinary Least Squares Method (OLS)

෠ 𝑡
𝑌෠𝑡 = 𝑎ො + 𝑏𝑋
𝑒𝑡 = 𝑌𝑡 − 𝑌෠𝑡
𝒀𝒕 = The actual or observed sales revenue
෡ 𝒕 = The sales revenue of the firm estimated from the regression line.
𝒀
𝒆𝒕 = Vertical deviation or error of the actual or observed sales
revenue.

15
Simple Linear Regression Formula
ෝ = 𝒂 + 𝒃𝑿
𝒚

σ𝑛
𝑡=1 𝑌
• 𝑎 = 𝑌ത − 𝑏𝑋ത • 𝑌ത =
𝑛

σ𝑛 ത
𝑡=1(𝑋𝑡 −𝑋)(𝑌−𝑌)

• 𝑏= σ𝑛
𝑡=1 𝑋
σ𝑛
𝑡=1 𝑋𝑡 −𝑋
ത 2 • 𝑋ത =
𝑛

16
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ഥ ഥ ഥ (𝒀 − 𝒀
ഥ) ഥ 𝟐
Year Sales (𝐘) 𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 𝑿−𝑿
Expense (𝐗)
1 44 10
2 40 9
3 42 11
4 46 12
5 48 11
6 52 12
7 54 13
8 58 13
9 56 14
10 60 15

෍= 500 120

17
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ഥ ഥ ഥ (𝒀 − 𝒀
ഥ) ഥ 𝟐
Year Sales (𝐘) 𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 𝑿−𝑿
Expense (𝐗)
1 44 10 -2 -6 12 4
2 40 9 -3 -10 30 9
3 42 11 -1 -8 8 1
4 46 12 0 -4 0 0
5 48 11 -1 -2 2 1
6 52 12 0 2 0 0
7 54 13 1 4 4 1
8 58 13 1 8 8 1
9 56 14 2 6 12 4
10 60 15 2 10 30 9

෍= 500 120 106 30

18
Advertising and Sales Revenues of the Firm in Each of 10
Years

Advertising ഥ ഥ ഥ (𝒀 − 𝒀
ഥ) ഥ 𝟐
Year Sales (𝐘) 𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 𝑿−𝑿
Expense (𝐗)

෍= 500 120 106 30

σ𝑛
𝑡=1 𝑌 500
• 𝑦ത = = = 50
𝑛 10

σ𝑛
𝑡=1 𝑋 120
• 𝑋ത = = = 12
𝑛 10

19
Advertising and Sales Revenues of the Firm in Each of 10
Years

Advertising ഥ ഥ ഥ (𝒀 − 𝒀
ഥ) ഥ 𝟐
Year Sales (𝐘) 𝑿−𝑿 𝒀−𝒀 𝑿−𝑿 𝑿−𝑿
Expense (𝐗)

෍= 500 120 106 30

σ𝑛 ത
𝑡=1(𝑋𝑡 −𝑋)(𝑌−𝑌)
ത 106
• 𝑌ത = 50
• 𝑏= σ𝑛 ത 2
= = 3.533
𝑡=1 𝑋𝑡 −𝑋 30

• 𝑋ത = 12 • 𝑎 = 𝑌ത − 𝑏𝑋ത = 50 − 3.533 12 = 7.6

20
Ordinary Least Squares Method (OLS)
ෝ = 𝒂 + 𝒃𝑿
𝒚
𝑌෠𝑡 = 7.60 + 3.53𝑋𝑡
• This regression line indicates that with zero advertising
expenditures (i.e., with 𝑋𝑡 = 0), the expected sales revenue of the
firm 𝑌෠𝑡 is $7.60 million.
𝑌෠𝑡 = 7.60 + 3.53 0 = $7.60 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
• With advertising of $10 million as in the first observation year (𝑋1 =
$10 𝑚𝑖𝑙𝑙𝑖𝑜𝑛)
𝑌෠𝑡 = 7.60 + 3.53 10 = $42.90 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
• On the other hand, with 𝑋10 = $15 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
𝑌෠𝑡 = 7.60 + 3.53 15 = $60.55 𝑚𝑖𝑙𝑙𝑖𝑜𝑛
21
Ordinary Least Squares Method (OLS)

Plotting these last two


points (𝟏𝟎, 𝟒𝟐. 𝟗𝟎) and
(15, 60.55) and joining
them by a straight line,
we obtain the regression
line.

22
𝒕-statistics or 𝒕 ratio
• The 𝑡-test in linear regression helps you make a statistical decision
about whether to accept or reject the null hypothesis related to the
impact of individual predictor variables on the dependent variable.

𝒃
𝒕=
𝑺𝒃

• The higher this calculated 𝑡 ratio is, the more confident we have
significant relationship between 𝑋 (advertising) and 𝑌 (sales).

23
Tests of Significance
• To test the hypothesis that 𝑏 is statistically significant (i.e. that
advertising positively affects sales), we need first of all to calculate
the standard error (deviation) of 𝑏
σ(𝑌𝑡 − 𝑌෡𝑡 )2
𝑆𝑏 =
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋) ത 2

σ 𝑒𝑡2
𝑆𝑏 =
ത 2
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋)

24
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ഥ ഥ ෡ = 𝟕. 𝟔𝟎 + 𝟑. 𝟓𝟑𝑿𝒕 ෡
Year Sales (𝐘) 𝑿−𝑿 𝑿−𝑿 𝟐
𝒀 𝒆𝒕 = 𝒀 − 𝒀 𝒆𝟐𝒕
Expense (𝐗)
1 44 10 -2 4
2 40 9 -3 9
3 42 11 -1 1
4 46 12 0 0
5 48 11 -1 1
6 52 12 0 0
7 54 13 1 1
8 58 13 1 1
9 56 14 2 4
10 60 15 2 9

෍= 500 120 30

25
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ഥ ഥ ෡ = 𝟕. 𝟔𝟎 + 𝟑. 𝟓𝟑𝑿𝒕 ෡
Year Sales (𝐘) 𝑿−𝑿 𝑿−𝑿 𝟐
𝒀 𝒆𝒕 = 𝒀 − 𝒀 𝒆𝟐𝒕
Expense (𝐗)
1 44 10 -2 4 42.90 1.10 1.2100
2 40 9 -3 9 39.37 0.63 0.3969
3 42 11 -1 1 46.43 -4.43 19.6249
4 46 12 0 0 49.96 -3.96 15.6816
5 48 11 -1 1 46.43 1.57 2.4649
6 52 12 0 0 49.96 2.04 4.1616
7 54 13 1 1 53.49 0.51 0.2601
8 58 13 1 1 53.49 4.51 20.3401
9 56 14 2 4 57.02 -1.02 1.0404
10 60 15 2 9 60.55 -0.55 0.3025

෍= 500 120 30 65.4830

26
Tests of Significance
• To test the hypothesis that 𝑏 is statistically significant (i.e. that
advertising positively affects sales), we need first of all to calculate
the standard error (deviation) of 𝑏

σ(𝑌𝑡 − 𝑌෡𝑡 )2
𝑆𝑏 =
(𝑛 − 𝑘) σ(𝑋𝑡 − 𝑋) ത 2

65.483
𝑆𝑏 = = 0.52
(10 − 2)(30)
27
𝒕-statistics or 𝒕 ratio
𝟑. 𝟓𝟑
𝒕= = 𝟔. 𝟕𝟗
𝟎. 𝟓𝟐

• We compare the calculated 𝑡 ratio to the critical value of the 𝑡


distribution with 2 degree of freedom (𝑑𝑓) with 5% level of
significance.

28
29
𝒕-statistics or 𝒕 ratio
• The critical value is 𝒕 = 𝟐. 𝟑𝟎𝟔 for two tailed 𝑡 test.
• Since our calculated value of 𝒕 = 𝟔. 𝟕𝟗 exceeds the tabular value of
𝑡 = 2.306 for the 𝟓% level of significance with 𝟖 𝒅𝒇.
𝑡𝑐 > 𝑡
6.79 > 2.306
• We reject the null hypothesis that there is no relationship between
𝑋 (advertising) and 𝑌 (sales) and
• We accept the alternate hypothesis there is a significant relationship
between 𝑋 and 𝑌.
• It means that we are 95% confident that such a relationship exists.
30
𝟐
Coefficient of determination (𝑹 )
• 𝑅2 measures how much of the variation in the firm’s sales is
explained by the variation in its advertising expenditures.

σ ( ෠ − 𝑌)
𝑌 ത 2
𝑅2 =
ത 2
σ(𝑌𝑡 − 𝑌)

31
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ෡ = 𝟕. 𝟔𝟎 + 𝟑. 𝟓𝟑𝑿𝒕 ෡−𝒀
ഥ 𝟐 ഥ ഥ 𝟐
Year Sales (𝐘) 𝒀 𝒀 ෡−𝒀
𝒀 ഥ 𝒀−𝒀 𝒀−𝒀
Expense (𝐗)
1 44 10 42.90 -7.1 50.41 -6 36
2 40 9 39.37 -10.63 112.996 -10 100
3 42 11 46.43 -3.57 12.744 -8 64
4 46 12 49.96 -0.04 0.0016 -4 16
5 48 11 46.43 -3.57 12.744 -2 4
6 52 12 49.96 -0.04 0.0016 2 4
7 54 13 53.49 3.49 12.180 4 16
8 58 13 53.49 3.49 12.180 8 64
9 56 14 57.02 7.02 49.280 6 36
10 60 15 60.55 10.55 111.302 10 100

෍= 500 120 440


373.8097
32
𝟐
Coefficient of determination (𝑹 )
σ ( ෠ − 𝑌)
𝑌 ത 2
𝑅2 =
ത 2
σ(𝑌𝑡 − 𝑌)

373.84
𝑅2 = = 0.85
440

This means that 85% of the total variation in the firm’s sales is
accounted for by the variation in the firm’s advertising
expenditures.

33
Coefficient of correlation (𝒓)
𝒓= 𝑹𝟐

This is simply a measure of the degree of association or


covariation that exists between variables X & Y. For our
advertising-sales example,
𝒓= 𝑹𝟐 = 𝟎. 𝟖𝟓 = 𝟎. 𝟗𝟐

This means that variables X & Y vary together 92% of the time.

34
Advertising and Sales Revenues of the Firm in Each of 10
Years
Advertising ഥ ഥ ෡ = 𝟕. 𝟔𝟎 + 𝟑. 𝟓𝟑𝑿𝒕 ෡
Year Sales (𝐘) 𝑿−𝑿 𝑿−𝑿 𝟐
𝒀 𝒆𝒕 = 𝒀 − 𝒀 𝒆𝟐𝒕
Expense (𝐗)
1 44 10 -2 4 42.90 1.10 1.2100
2 40 9 -3 9 39.37 0.63 0.3969
3 42 11 -1 1 46.43 -4.43 19.6249
4 46 12 0 0 49.96 -3.96 15.6816
5 48 11 -1 1 46.43 1.57 2.4649
6 52 12 0 0 49.96 2.04 4.1616
7 54 13 1 1 53.49 0.51 0.2601
8 58 13 1 1 53.49 4.51 20.3401
9 56 14 2 4 57.02 -1.02 1.0404
10 60 15 2 9 60.55 -0.55 0.3025

෍= 500 120 30 65.4830

35
Standard Deviation of Regression or
Standard Error of Estimate
• All the observed values of (𝑦, 𝑋) do not fall on the regression line
but they scatter away form it.
• The standard error of estimate is the standard deviation of
multiple regression.
• The sample standard error of estimate denoted by 𝑆𝑋,𝑦
σ 𝑌−𝑌෠ 2
𝑆𝑋,𝑦 =
𝑛−2

36
Standard Deviation of Regression or
Standard Error of Estimate
σ 𝑌−𝑌෠ 2
𝑆𝑋,𝑦 =
𝑛−2

65.4830
𝑆𝑋,𝑦 =
10 − 2

𝑆𝑋,𝑦 = 2.86

37
Practice Problem
ෝ = 𝒂 + 𝒃𝑿
𝒚
• 𝑦ത =
σ𝑛
𝑡=1 𝑌 • 𝑎 = 𝑌ത − 𝑏𝑋ത Year Sales (𝐘) Q𝐮𝐚𝐥𝐢𝐭𝐲 (𝐗)
𝑛
σ𝑛 σ𝑛 ത
𝑡=1(𝑋𝑡 −𝑋)(𝑌−𝑌)
ത 1 16 5
• 𝑋ത = 𝑡=1 𝑋 • 𝑏= σ𝑛 ത 2 2 19 6
𝑛 𝑡=1 𝑋𝑡 −𝑋
3 23 8
𝒃 σ(𝑌𝑡 −𝑌෡𝑡 )2
• 𝒕= • 𝑆𝑏 = 4 28 10
𝑺𝒃 (𝑛−𝑘) σ(𝑋𝑡 −𝑋) ത 2
5 36 12
6 41 13
σ ( ෠−
𝑌 ത 2
𝑌)
𝑅2 = 𝒓= 𝑹𝟐
7 44 15
ത 2
σ(𝑌𝑡 − 𝑌) 8 45 16
9 50 17

2
σ 𝑌 − 𝑌෠
𝑆𝑋,𝑦 =
𝑛−2 38
Acknowledgment
• [Peter Andrew Bruce] Practical Statistics for Data Scientists
• [David Forsyth] Probability and Statistics for Computer Science
• [Michael Baron] Probability and Statistics for Computer Scientists
• .

39

You might also like