1 (B) - Data Science Correlation and Linear Regression - Cied
1 (B) - Data Science Correlation and Linear Regression - Cied
16.3
Correlation
r
X X Y Y .815
X X Y Y
2 2
We estimate its value from sample data with the sample coefficient of correlation:
16.6
n=12, then df= 12‐2=10
Coefficient of Correlation
The population coefficient of correlation is denoted (rho)
We estimate its value from sample data with the sample coefficient of correlation:
n2 12 2
.815 4.4477
Probability less than the critical value (t1‐α,ν)
r
1 r 2
1.815 2 ν 0.90 0.95 0.975 0.99 0.995 0.999
10. 1.372 1.812 2.228 2.764 3.169 4.143
(1 ‐ t Distribution( :Column 1, 10 )) * 2
P‐VALUE< ALPHA
DISCUSSION CASE 1
Consumer Reports' Quality Rating and Pricing
Consumers often wonder if paying more for one particular brand over another gives them a better-quality product.
Among the most useful sources of information for comparing brand names of specific products are the various consumers'
associations.
In a recent issue of Consumer Reports, 18 top-of-the-line stereo large-screen TV sets (mostly 25-inchers, and the rest 26-
and 27-inchers) were compared; the results are shown in the accompanying table. Notice that the various brands are ranked in
order of estimated overall quality - that is, by the overall ratings score. The ratings table in the article also states that
differences of less than 7 points in the overall ratings score - and differences of less than 10 points in the tone quality/accuracy
score - were judged not very significant. Furthermore, the article concluded that all the sets tested had acceptable picture
quality, the factor that was used as the chief criterion in rating the TVs.
On the basis of the available data, can you conclude that "you get what you pay for“
Assignment Questions: Find the correlation for the following pairs of variables:
2. Test each correlation in question 1 to determine if "you get what you paid for." Use α = .05.
SOLUTION
Analysis and Solution
For Example, in 1) In order to determine if quality, and price are positively related, we need to test
correlation. We will use the actual price Consumers Union paid and calculate correlation with overall
quality and tone quality.
Overall Quality
H0: ρ = 0
HA: ρ > 0
CORRELATIONS AND P‐VALUE
Effect of outliers on
Pearson correlation
x value y value x value y value
1 4 1 4
1 1 1 2
12
2 3 2 3
2 3 2 3 10
3 1 3 1 8
3 4 3 4 6
4 3 4 3 4
4 2 4 2
2
4 3 10 10
0
10
6
Anscombes Data Set.jmp
0
0 1 2 3 4 5 6 7 8 9 10
13
.
Anscombe’s Data Set
Observation x1 y1 x2 y2 x3 y3 x4 y4
1 10 8.04 10 9.14 10 7.46 8 6.58
2 8 6.95 8 8.14 8 6.77 8 5.76
3 13 7.58 13 8.74 13 12.74 8 7.71
4 9 8.81 9 8.77 9 7.11 8 8.84
5 11 8.33 11 9.26 11 7.81 8 8.47
6 14 9.96 14 8.1 14 8.84 8 7.04
7 6 7.24 6 6.13 6 6.08 8 5.25
8 4 4.26 4 3.1 4 5.39 19 12.5
9 12 10.84 12 9.13 12 8.15 8 5.56
10 7 4.82 7 7.26 7 6.42 8 7.91
11 5 5.68 5 4.74 5 5.73 8 6.89
Summary
Statistics
N 11 11 11 11 11 11 11 11
mean 9.00 7.50 9.00 7.50091 9.00 7.50 9.00 7.50
SD 3.16 1.94 3.16 1.94 3.16 1.94 3.16 1.94
r 0.82 0.82 0.82 0.82
F.J. Anscombe, "Graphs in Statistical Analysis," American Statistician, vol. 27 (Feb 1973), pp. 17-21
Run regression
LINEST OUTPUT x1‐y1 x2‐y2 x3‐y3 x4‐y4
SE(b1) SE(b0) 0.12 1.12 0.12 1.13 0.12 1.12 0.12 1.12
Reg SS SSR 27.51 13.76 27.50 13.78 27.47 13.76 27.49 13.74
10 10
8 8
6 6
2 2
0 0
0 5 10 15 0 5 10 15
12 12
10 10
8 8
6 6
y = 0.4999x + 3.0017
4 4
y = 0.4997x + 3.0025
2 2
0 0
0 5 10 15 0 5 10 15 20
Spurious Correlation
BE
CAREFUL
WITH
SMALL
SAMPLES
https://www.nytimes.com/2018/
03/06/upshot/texas‐primary‐
democrats‐voter‐enthusiasm‐
turnout.html
Regression Models
H0: 0 b 1 H0: B
1 t 1 1
H1: 0
S b
H1: B
1 1
where:
S e
S
H 0: 1 0 H0: B
b
SS XX
1
SSE
H 1: 1 0 S e
n2 H1: B
1
H 0: 1 0
X
2
H0: B
SS XX X 2
n 1
df n 2
Source: Business Statistics:
b1
Contemporary Decision Making, by
Ken Black. 6th edition
1
DISCUSSION CASE
Market Model of Stock Returns
A well-known model in finance, called the market model, assumes that the monthly rate of return
on a stock (R) is linearly related to the monthly rate of return on the overall stock market (Rm).
The mathematical description of the model is
R = β0 + β1Rm + ε
where the error term ε is assumed to satisfy the requirements of the linear regression model. For
practical purposes, Rm is taken to be the monthly rate of return on some major stock market
index, such as the New York Stock Exchange Composite Index.
The coefficient β1, called the stock's beta coefficient, measures how sensitive the stock's rate of
return is to changes in the level of the overall market. For example, if β1 > 1 (β1 < 1), the stock's
rate of return is more (less) sensitive to changes in the level of the overall market than is the
average stock.
The monthly rate of return for Host International Inc. stock and for the overall market (as
approximated by the NYSE Composite Index) over a 5-year period
Constant ˆ 0 s ˆ ˆ 0 / s ˆ
0 0
X ˆ1 s ˆ ˆ1 / s ˆ
1 1
Analysis of Variance
Source DF SS MS F
Regression 1 SSR MSR=SSR/1 MSR/MSE
Residual Error n-2 SSE MSE=SSE/(n-2)
Total n-1 SSyy
Testing the Overall Model
F-test
Ho) beta 1 =0
Questions
• (more sensitive)
• .05
• df 60 - 2 = 58
• Critical Value(s):
Reject at α = .05
Questions