Statistik 2
Regression and Correlation
Capaian Pembelajaran
▪ Mampu menentukan tujuan pengamatan untuk pengolahan data
menggunakan analisis regresi.
▪ Mampu menemukan variabel amatan dalam melakukan analisis
regresi.
2
Correlation Analysis
SCATTER PLOT AND CORRELATION
• Scatter plot (or scatter diagram) is used to show the relationship
between two variables
• Correlation analysis is used to measure strength of the association
(linear relationship) between two variables
→Only concerned with strength of the relationship
→No causal effect is implied
4
SCATTER PLOT EXAMPLES
5
SCATTER PLOT EXAMPLES
6
SCATTER PLOT EXAMPLES
7
CORRELATION COEFFICIENT
✓The population correlation coefficient ρ measures the strength of
the relationship between two variables
✓The sample correlation coefficient r is an estimate of ρ and is used
to measure the strength of the linear relationship in the sample
observations
8
FEATURES OF CORRELATION COEFFICIENT
a. Unit free
b. A correlation coefficient of -1.00 or +1.00
c. The closer to -1.00, the stronger the negative linear relationship
d. The closer to +1.00, the stronger the positive linear relationship
e. The closer to 0, the weaker the linear relationship
9
EXAMPLES OF APPROXIMATE r VALUES
10
CALCULATION OF COEFFICIENT CORRELATION
• Sample correlation coefficient :
𝑺𝒙𝒚
𝒓=
𝑺𝒙𝒙 𝑺𝒚𝒚
With: Where:
r : sample correlation coefficient
𝑆𝑥𝑥 = σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 , n : sample size
𝑆𝑦𝑦 = σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2, xi : value of observation i in independent variable
yi : value of observation i in dependent variable
𝑆𝑥𝑦 = σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦
ҧ 𝑖 − 𝑦),
ത 𝑥ҧ : average value of independent variable
𝑦ത : average value of dependent variable
11
EXAMPLE OF COEFFICIENT CORRELATION CALCULATION
We want to evaluate the relationship between the number of
sales calls and the number of products sold.
Calls Sales (𝑥𝑖 − 𝑥)ҧ (𝑥𝑖 − 𝑥)ҧ 2 ത
(𝑦𝑖 − 𝑦) ത 2
(𝑦𝑖 − 𝑦) ҧ 𝑖 − 𝑦)
(𝑥𝑖 −𝑥)(𝑦 ത
20 $ 30.00 -2 4 $ -15.00 $ 225.00 $ 30.00 𝑆𝑥𝑦
40 $ 60.00 18 324 $ 15.00 $ 225.00 $ 270.00 𝑟=
20 $ 40.00 -2 4 $ -5.00 $ 25.00 $ 10.00
𝑆𝑥𝑥 𝑆𝑦𝑦
30 $ 60.00 8 64 $ 15.00 $ 225.00 $ 120.00
10 $ 30.00 -12 144 $ -15.00 $ 225.00 $ 180.00 $ 900
10 $ 40.00 -12 144 $ -5.00 $ 25.00 $ 60.00 = = 0.759
20 $ 40.00 -2 4 $ -5.00 $ 25.00 $ 10.00 760 (1850)
20 $ 50.00 -2 4 $ 5.00 $ 25.00 $ -10.00
20 $ 30.00 -2 4 $ -15.00 $ 225.00 $ 30.00
30 $ 70.00 8 64 $ 25.00 $ 625.00 $ 200.00
22 $ 45.00 𝑺𝒙𝒙 =760 𝑺𝒚𝒚 = $ 1,850.00 𝑺𝒙𝒚 =$ 900.00
12
EXAMPLE OF COEFFICIENT CORRELATION CALCULATION
Using Excel Features: Data – Data Analysis -Correlation
Calls Sales
Calls 1
Sales 0.759014 1 It does not show us any cause-
and-effect relationship between
two variables
r = 0.759 → strong positive linear
relationship between the number of
calls and the number of sales
13
SIGNIFICANCE TEST OF THE CORRELATION COEFFICIENT
On the previous example, we found that r = 0.759
→ It is based on 10 samples observed
→ How can we conclude about the relationship between two variables in the population?
𝐻𝑜 : 𝜌 = 0 𝑛𝑜 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝐻1 : 𝜌 ≠ 0 (𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑖𝑠𝑡𝑠)
Test statistic:
𝒓 𝒏−𝟐
𝑻=
𝟏 − 𝒓𝟐
With n-2 degrees of freedom
14
SIGNIFICANCE TEST OF THE CORRELATION COEFFICIENT
With α significance level, the critical region will be:
𝑇 < −𝑡𝛼/2,𝑣 or 𝑇 > 𝑡𝛼/2,𝑣
a/2 a/2
Reject H0 Do not reject H0 Reject H0
-tα/2 tα/2
0
15
EXAMPLE OF SIGNIFICANCE TEST OF r
1. State the hypothesis
𝐻𝑜 : 𝜌 = 0 𝑛𝑜 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛
𝐻1 : 𝜌 ≠ 0 (𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑒𝑥𝑖𝑠𝑡𝑠)
2. Define critical value and critical region:
With α=0.05 and v=n-2=8 → 𝑡𝛼/2,𝑣 = 𝑡.025,8 = 2.306
Critical region: 𝑇 < −2.306 or 𝑇 > 2.306
3. Compute the T-statistic:
𝑟 𝑛−2 0.756 8
𝑇= = = 3.28
1− 𝑟2 1 − 0.7592
4. Evaluate the hypothesis
T = 3.28 > t = 2.306 → reject H0
5. Conclusion
At 5% significance level,
there is a positive correlation between the number of calls and the number of sales in the population
REGRESSION ANALYSIS
17
INTRODUCTION TO REGRESSION ANALYSIS
Regression analysis is used to:
▪ Predict the value of a dependent variable based on the value of at least one
independent variable
▪ Explain the impact of changes in an independent variable on the dependent variable
Analyse the causality relationship between independent and dependent variables
Independent variable:
• the variable we use to explain the dependent variable
• Predictor variable → use to predict the expected value of dependent variable
Dependent variable:
• the variable we wish to explain
• The variable that is being predicted or estimated
18
SIMPLE LINEAR REGRESSION MODEL
• Only one independent variable (x) → one regressor
• Relationship between x and y is described by a linear
function
• Changes in y are assumed to be caused by changes in x
19
TYPES OF REGRESSION MODELS
20
SIMPLE LINEAR REGRESSION MODEL
A linear relationship form between the response Y and the
regressor x:
𝒀 = 𝜶 + 𝜷𝒙
Where 𝛼 is the intercept, and 𝛽 is the slope
However, the relationship between Y and x is not
deterministic → there must be a random component to the
equation that relates to the variables.
Thus, the model will be:
𝒀 = 𝜶 + 𝜷𝒙 + 𝝐
Where 𝝐 is a random variable that is assumed to be
distributed with E(𝝐)=0 and Var(𝝐)=σ2
21
SIMPLE LINEAR REGRESSION MODEL
Interpretation:
✓ The quantity Y is a random since 𝝐 is random
✓ The value regressor x is not random
22
SIMPLE LINEAR REGRESSION ASSUMPTIONS
✓ Error values (ε) is statistically independent
✓ Error values are normally distributed for any given value of
x and have constant variance
✓ The underlying relationship between the x variable and the
y variable is linear
23
POPULATION AND SAMPLE REGRESSION MODEL
Unknown 𝑦ො = 𝑎 + 𝑏𝑥
relationship
𝑦 = 𝛼 + 𝛽𝑥
24
ESTIMATED REGRESSION MODEL
The sample regression model provides an estimate of the population
regression line
25
LINEAR REGRESSION MODEL
In regression analysis, the objective is to use the data to position a line
that best represent the relationship between the two variables
→How do we find the best fitted line for the data?
the first approach is to use a scatter diagram to visually position the
line
26
SCATTER DIAGRAM
1. Plot of all (Xi,Yi) pairs
2. Suggest how well model will fit
27
SCATTER DIAGRAM
How would you draw a line through the points? How do you determine which line ‘fits best’?
28
SCATTER DIAGRAM
How would you draw a line through the points? How do you determine which line ‘fits best’?
29
SCATTER DIAGRAM
How would you draw a line through the points? How do you determine which line ‘fits best’?
30
SCATTER DIAGRAM
How would you draw a line through the points? How do you determine which line ‘fits best’?
31
LEAST SQUARES PRINCIPLE
• We would like to choose a line that would minimize the error
between the actual data and the line → residual
Error in Fit:
Given a set of regression data [(xi,yi );i:1,2,…,n] and a fitted model
𝑦ො𝑖 = 𝑎 + 𝑏𝑥, the ith residual ei is given by
𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
Using least squares method, we
minimize the sum of squares of
vertical deviations from the points to
the line → LEAST SQUARES VALUE
32
LEAST SQUARES EQUATION
Given a set of regression data [(xi,yi );i:1,2,…,n], the least squares estimate a and
b of the regression coefficients α and β are computed from the formulas:
σ𝒏𝒊=𝟏 𝒙𝒊 − ഥ ഥ) 𝑺𝒙𝒚
𝒙 (𝒚𝒊 − 𝒚
𝒃= 𝒏 𝟐
=
σ𝒊=𝟏(𝒙𝒊 − ഥ
𝒙) 𝑺𝒙𝒙
σ𝒏𝒊=𝟏 𝒚𝒊 − 𝒃 σ𝒏𝒊=𝟏 𝒙𝒊
𝒂= ഥ − 𝒃ഥ
=𝒚 𝒙
𝒏
33
INTERPRETATION OF LEAST SQUARES MODEL
➢ a is the estimated average value of y when the value of x is zero
➢ b is the estimated change in the average value of y as a result of a one-unit
change in x
Note:
The regression equation is not generally used for the points outside the range of
the sample values
34
SIMPLE LINEAR REGRESSION EXAMPLE
Recall the previous example!
We want to evaluate whether the number of sales calls affects the number of
products sold?
Calls (xi) Sales (yi) (𝑥𝑖 − 𝑥)ҧ (𝑥𝑖 − 𝑥)ҧ 2 ത
(𝑦𝑖 − 𝑦) ത 2
(𝑦𝑖 − 𝑦) ҧ 𝑖 − 𝑦)
(𝑥𝑖 −𝑥)(𝑦 ത
20 $ 30.00 -2 4 $ -15.00 $ 225.00 $ 30.00
40 $ 60.00 18 324 $ 15.00 $ 225.00 $ 270.00
-2 4 $ -5.00 $ 25.00 $ 10.00
900
20 $ 40.00 𝑏= = 1.18
30 $ 60.00 8 64 $ 15.00 $ 225.00 $ 120.00 760
10 $ 30.00 -12 144 $ -15.00 $ 225.00 $ 180.00 𝑎 = 45 − 1.18 22 = 18.95
10 $ 40.00 -12 144 $ -5.00 $ 25.00 $ 60.00
20 $ 40.00 -2 4 $ -5.00 $ 25.00 $ 10.00
20 $ 50.00 -2 4 $ 5.00 $ 25.00 $ -10.00
20 $ 30.00 -2 4 $ -15.00 $ 225.00 $ 30.00
30 $ 70.00 8 64 $ 25.00 $ 625.00 $ 200.00
𝑥ҧ =22 𝑦ത = $ 45.00 0 𝑺𝒙𝒙 =760 0 𝑺𝒚𝒚 =$ 1,850.00 𝑺𝒙𝒚 =$ 900.00
35
INTERPRETATION THE MODEL
Linear regression equation:
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒂𝒍𝒆𝒔 = 𝟏𝟖. 𝟗𝟓 + 𝟏. 𝟏𝟖(𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐜𝐚𝐥𝐥𝐬)
a = 18.95 is the estimated average value of Y when the value of X is zero
However, x = 0 is not in the range of the sample values → should not be used to estimate the
number of products sold
→ The number of calls ranged from 10 to 40, so estimates should be limited to that range
The value a = 18.95 is also can be described as the portion of the number of products sold not
explained by the number of calls
→ Probably by any other variable
36
INTERPRETATION THE MODEL
Linear regression equation:
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒂𝒍𝒆𝒔 = 𝟏𝟖. 𝟗𝟓 + 𝟏. 𝟏𝟖(𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐜𝐚𝐥𝐥𝐬)
b = 1.18 is the estimated change in the average value of Y as a result of
one-unit change in X
→ It tells use that the average value of the number of sales increases by
1.18 unit, on average, for each additional number of call
37
Selesai