0% found this document useful (0 votes)
2 views37 pages

1 (B) - Data Science Correlation and Linear Regression - Cied

The document reviews correlation and linear regression, emphasizing the coefficient of correlation (r) as a measure of linear relationships between variables. It discusses the assumptions required for correlation analysis and provides examples of testing correlations in consumer product quality and pricing. Additionally, it touches on regression models and hypothesis testing for regression slopes, highlighting the importance of understanding spurious correlations.

Uploaded by

Javier Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views37 pages

1 (B) - Data Science Correlation and Linear Regression - Cied

The document reviews correlation and linear regression, emphasizing the coefficient of correlation (r) as a measure of linear relationships between variables. It discusses the assumptions required for correlation analysis and provides examples of testing correlations in consumer product quality and pricing. Additionally, it touches on regression models and hypothesis testing for regression slopes, highlighting the importance of understanding spurious correlations.

Uploaded by

Javier Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA SCIENCE

Review on Correlation and Linear Regression


1. Correlation is a measure of the
degree of relatedness of variables.

2. Coefficient of Correlation (r) ‐


applicable only if both variables
being analyzed have at least an
interval level of data.

3. It measures only Linear


Relationships
Degrees of Correlation
Source: Business Statistics:
Contemporary Decision Making, by
Ken Black. 6th edition
Coefficient of Correlation
We can also the Pearson coefficient of correlation to test for a linear relationship
between two variables.
Recall:
The coefficient of correlation’s range is between –1 and +1. RHO
• If r = –1 (negative association) or r = +1 (positive association) every point falls on the regression line.
• If r = 0 there is no linear pattern

It requires the following data assumptions to hold:


1. interval or ratio level;
2. linearly related;
3. bivariate normally distributed.(for TEST and conf. interval)
(test for normality requires a minimum of 15 data points)

16.3
Correlation

r
  X  X Y  Y   .815
  X  X   Y Y 
2 2

INTEREST RATE FEDERAL FUNDS.jmp


Source: Business Statistics:
Contemporary Decision Making, by
Ken Black. 6th edition
Coefficient of Correlation
The population coefficient of correlation is denoted (rho)

We estimate its value from sample data with the sample coefficient of correlation:

The test statistic for testing H0 : = 0 is:

Which is Student t-distributed with n–2 degrees of freedom.

16.6
n=12, then df= 12‐2=10
Coefficient of Correlation
The population coefficient of correlation is denoted (rho)

We estimate its value from sample data with the sample coefficient of correlation:

The test statistic for testing H0 : = 0 is: r ൌ 815

Which is Student t-distributed with n–2 degrees of freedom.


n=12, then df= 12‐2=10

n2 12  2
 .815  4.4477
Probability less than the critical value (t1‐α,ν)
r
1 r 2
1.815 2 ν 0.90 0.95 0.975 0.99 0.995 0.999
10. 1.372 1.812 2.228 2.764 3.169 4.143

Source: Business Statistics:


Contemporary Decision Making, by
Ken Black. 6th edition 16.8
H0) NO CORRELATION

(1 ‐ t Distribution( :Column 1, 10 )) * 2

P‐VALUE< ALPHA
DISCUSSION CASE 1
Consumer Reports' Quality Rating and Pricing

Consumers often wonder if paying more for one particular brand over another gives them a better-quality product.
Among the most useful sources of information for comparing brand names of specific products are the various consumers'
associations.
In a recent issue of Consumer Reports, 18 top-of-the-line stereo large-screen TV sets (mostly 25-inchers, and the rest 26-
and 27-inchers) were compared; the results are shown in the accompanying table. Notice that the various brands are ranked in
order of estimated overall quality - that is, by the overall ratings score. The ratings table in the article also states that
differences of less than 7 points in the overall ratings score - and differences of less than 10 points in the tone quality/accuracy
score - were judged not very significant. Furthermore, the article concluded that all the sets tested had acceptable picture
quality, the factor that was used as the chief criterion in rating the TVs.
On the basis of the available data, can you conclude that "you get what you pay for“

Assignment Questions: Find the correlation for the following pairs of variables:

a. overall ratings score and manufacturer's suggested list price


b. overall ratings score and price Consumers Union paid CASE Quality and Price.jmp
c. tone quality/accuracy score and manufacturer's list price
d. tone quality/accuracy score and price Consumers Union paid

2. Test each correlation in question 1 to determine if "you get what you paid for." Use α = .05.
SOLUTION
Analysis and Solution
For Example, in 1) In order to determine if quality, and price are positively related, we need to test
correlation. We will use the actual price Consumers Union paid and calculate correlation with overall
quality and tone quality.
Overall Quality
H0: ρ = 0
HA: ρ > 0
CORRELATIONS AND P‐VALUE
Effect of outliers on
Pearson correlation
x value y value x value y value
1 4 1 4
1 1 1 2
12

2 3 2 3
2 3 2 3 10

3 1 3 1 8

3 4 3 4 6

4 3 4 3 4

4 2 4 2
2

4 3 10 10
0

Pearson R= 0.0000 Pearson R= 0.812324 0 2 4 6 8 10 12

10

6
Anscombes Data Set.jmp

0
0 1 2 3 4 5 6 7 8 9 10

13
.
Anscombe’s Data Set
Observation x1 y1 x2 y2 x3 y3 x4 y4
1 10 8.04 10 9.14 10 7.46 8 6.58
2 8 6.95 8 8.14 8 6.77 8 5.76
3 13 7.58 13 8.74 13 12.74 8 7.71
4 9 8.81 9 8.77 9 7.11 8 8.84
5 11 8.33 11 9.26 11 7.81 8 8.47
6 14 9.96 14 8.1 14 8.84 8 7.04
7 6 7.24 6 6.13 6 6.08 8 5.25
8 4 4.26 4 3.1 4 5.39 19 12.5
9 12 10.84 12 9.13 12 8.15 8 5.56
10 7 4.82 7 7.26 7 6.42 8 7.91
11 5 5.68 5 4.74 5 5.73 8 6.89

Summary
Statistics
N 11 11 11 11 11 11 11 11
mean 9.00 7.50 9.00 7.50091 9.00 7.50 9.00 7.50
SD 3.16 1.94 3.16 1.94 3.16 1.94 3.16 1.94
r 0.82 0.82 0.82 0.82
F.J. Anscombe, "Graphs in Statistical Analysis," American Statistician, vol. 27 (Feb 1973), pp. 17-21
Run regression
LINEST OUTPUT x1‐y1 x2‐y2 x3‐y3 x4‐y4

slope intercept 0.50 3 0.50 3 0.50 3 0.50 3

SE(b1) SE(b0) 0.12 1.12 0.12 1.13 0.12 1.12 0.12 1.12

R2 RMSE 0.67 1.24 0.67 1.24 0.67 1.24 0.67 1.24

F df 17.99 9 17.97 9 17.97 9 18.00 9

Reg SS SSR 27.51 13.76 27.50 13.78 27.47 13.76 27.49 13.74

(Bravais) Pearson correlation of x1 and y1 = 0.816


x1-y1 data x2-y2 data
12 12

10 10

8 8

6 6

4 y = 0.5001x + 3.0001 4 y = 0.5x + 3.0009

2 2

0 0
0 5 10 15 0 5 10 15

x3-y3 data x4-y4 data


14 14

12 12

10 10

8 8

6 6
y = 0.4999x + 3.0017
4 4
y = 0.4997x + 3.0025
2 2

0 0
0 5 10 15 0 5 10 15 20
Spurious Correlation

• In Business practice, the idea of spurious correlation is taken to


mean roughly that when two variables correlate, it is not because
one is a direct cause of the other but rather because they are
brought about by a third variable.
• This situation presents a major interpretative challenge to
Business Practice, a challenge that is heightened by the difficulty
of dis-entangling the various concepts associated with the idea of
spurious correlation.
• When the third variable is time this will induce unnecessary
collinearity
Source: http://tylervigen.com/spurious-correlations

spurious correlation bees.jmp


How Much Can Primaries Predict the General Election?

BE
CAREFUL
WITH
SMALL
SAMPLES

https://www.nytimes.com/2018/
03/06/upshot/texas‐primary‐
democrats‐voter‐enthusiasm‐
turnout.html
Regression Models

• Deterministic Regression Model - - produces an exact output:


y   0  1 x

• Probabilistic Regression Model


y   0  1 x  

• 0 and 1 are population parameters

• 0 and 1 are estimated by sample statistics b0 and b1


airlines example.jmp

Source: Business Statistics:


Contemporary Decision Making, by
Ken Black. 6th edition
Hypothesis Tests for the Slope
of the Regression Model

H0:  0 b  1 H0:  B
1 t 1 1

H1:   0
S b
H1:   B
1 1

where: 
S e
S
H 0:  1  0 H0:  B
b
SS XX
1
SSE
H 1:  1  0 S e

n2 H1:   B
1

H 0:  1  0
  X
2

H0:  B
SS XX  X 2

n 1

H 1:  1  0   the hypothesized slope H1:   B


1 1

df  n  2
 Source: Business Statistics:

b1  
Contemporary Decision Making, by
Ken Black. 6th edition

1
DISCUSSION CASE
Market Model of Stock Returns
A well-known model in finance, called the market model, assumes that the monthly rate of return
on a stock (R) is linearly related to the monthly rate of return on the overall stock market (Rm).
The mathematical description of the model is

R = β0 + β1Rm + ε

where the error term ε is assumed to satisfy the requirements of the linear regression model. For
practical purposes, Rm is taken to be the monthly rate of return on some major stock market
index, such as the New York Stock Exchange Composite Index.

The coefficient β1, called the stock's beta coefficient, measures how sensitive the stock's rate of
return is to changes in the level of the overall market. For example, if β1 > 1 (β1 < 1), the stock's
rate of return is more (less) sensitive to changes in the level of the overall market than is the
average stock.
The monthly rate of return for Host International Inc. stock and for the overall market (as
approximated by the NYSE Composite Index) over a 5-year period

market model stock return-DISCUSSION CASE DISCUSSION.jmp


THE R = β0 + β1Rm + ε
DATA Host = β0 + β1 Index + ε
REGRESSION

The regression equation is


Y = constant + slope * X

Predictor Coef SE Coef T P

Constant ˆ 0 s ˆ ˆ 0 / s ˆ
0 0

X ˆ1 s ˆ ˆ1 / s ˆ
1 1

S = √[SSE/(n-2)] R-Sq = SSR/SSyy

Analysis of Variance
Source DF SS MS F
Regression 1 SSR MSR=SSR/1 MSR/MSE
Residual Error n-2 SSE MSE=SSE/(n-2)
Total n-1 SSyy
Testing the Overall Model
F-test

• It is common in regression analysis to compute an F test


to determine the overall significance of the model.
• In multiple regression, this test determines whether at least one of the regression
coefficients (from multiple predictors) is different from zero.
• Simple regression provides only one predictor and only one regression coefficient
to test.
• Because the regression coefficient is the slope of the regression line, the F test for
overall significance is testing the same thing as the t test in simple regression
H0) USELESS
H1) SOME HOW
USEFUL

Ho) beta 1 =0
Questions

• a. Is Host International more sensitive than average to overall


stock market movements?

• b. Estimate (with 95% confidence) next month's expected rate of


return for Host International stock, given that the corresponding
expected rate of return for the overall market is x=0.5%.

• c. What proportion of the variability of the return for Host


International is explained by overall stock market movements?
Hypothesis Tests for the Slope
of the Regression Model
• As the slope of the regression line diverges from zero, the
regression model is adding predictability that the line is not
generating.
• Testing the slope of the regression line to determine
whether the slope is different from zero is important.
• If the slope is not different from zero, the regression line is
doing nothing more than the average line of y predicting y.
• You can test also that the slope differs from any other
number,
Is Host International more sensitive than average to overall stock
market movements?

• H0: 1 = 1 ˆ1  1 1.6007  1


t   2.17608
• Ha: 1  1 S ˆ 0.2756
1

• (more sensitive)

•   .05
• df  60 - 2 = 58
• Critical Value(s):
Reject at α = .05
Questions

• a. Is Host International more sensitive than average to overall stock


market movements?

• b. Estimate (with 95% confidence) next month's expected rate of


return for Host International stock, given that the corresponding
expected rate of return for the overall market is 0.5%.

• c. What proportion of the variability of the return for Host


International is explained by overall stock market movements?

You might also like