0% found this document useful (0 votes)
7 views

Week 03 Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Week 03 Regression

Uploaded by

sabrinashah2002
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

22-08-2024

TOD 533
Correlation, Introduction to Regression
Amit Das
TODS / AMSOM / AU
[email protected]

Association between interval variables


• Do two interval variables “move together” ?
• When one takes on “high” values (relative to its mean),
what does the other do?
• Pearson correlation coefficient

r
 Z x Zy
 1  r  1
N
• When high (low) z-scores of the two variables co-occur, the
correlation coefficient is larger

1
22-08-2024

Computing the correlation coefficient


Task 1 Task 2 Product of
z-scores
Student Raw Score z-score Raw Score z-score
1 42 +1.78 90 +1.21 +2.15
2 9 -1.04 40 -1.65 +1.72
3 28 +0.58 92 +1.33 +0.77
4 11 -0.87 50 -1.08 +0.94
5 8 -1.13 49 -1.13 +1.28
6 15 -0.53 63 -0.33 +0.17
7 14 -0.62 68 -0.05 +0.03
8 25 +0.33 75 +0.35 +0.12
9 40 +1.61 89 +1.16 +1.87
10 20 -0.10 72 +0.18 -0.02
SUM 212 0 688 0 +9.03
MEAN 21.2 0 68.8 0 +0.903
STD. DEV. 11.69 1 17.47 1

Eyeballing correlation

2
22-08-2024

Statistical significance of r
• Null hypothesis: r = 0
Compute test statistic = n2
r
1  r2
• Compare against t-distribution with df = n-2

• For r = 0.903 with n = 10,


• test statistic = 5.94, compare against t8 distribution
• p-value (2-tailed) = 0.0003 << 0.05

Correlation and sample size


• Significance of r depends on sample size
• for larger n, smaller value of r might be significant

Sample size Value of r required to reach statistical significance at …


10% (two-tailed) 5% (two-tailed)
12 0.497 0.576
22 0.360 0.423
32 0.296 0.349
42 0.257 0.304
52 0.231 0.273
102 0.164 0.195

• for very large n, a very small r might be significant


• statistical vs. managerial significance

3
22-08-2024

Association between ordinal variables


• The Spearman rank correlation coefficient

6 d 2

rs  1 
n n 2  1
• where d is the difference in the ranks of a given individual for the two
variables
• suitable for ordinal data
• less affected than Pearson r by outliers

Rank Correlation example


Task 1 Task 2 (Difference
in ranks)2
Student Raw Score Rank 1 Raw Score Rank 2
1 42 1 90 2 1
2 9 9 40 10 1
3 28 3 92 1 4
4 11 8 50 8 0
5 8 10 49 9 1
6 15 6 63 7 1
7 14 7 68 6 1
8 25 4 75 4 0
9 40 2 89 3 1
10 20 5 72 5 0

• Spearman rank correlation coefficient


= 6  10
1 = 94%
10  100  1

4
22-08-2024

Correlation and regression …1


• Earlier, we examined whether two interval-scaled variables are
associated (“move together”) using the correlation coefficient
-1  r  +1
• linear regression frames the same question in a slightly different form
• by modeling the dependent variable Y as a linear function of the independent
variable X
Y = a + bX

The linear regression model


slope b = p/q X
X
price in dollars

X
X X
p
X X

X q

X
intercept a

area in square feet

Relation of apartment prices to floor area (hypothetical)

5
22-08-2024

The best-fit regression line


• More than one line can be passed through the cloud (“scatterplot”) of
Y on X
• each line denotes a combination of a and b
• For each line
• for each data point compute error = Yobs – Ypred
• square the errors and add them up Se2
• The best-fit (least-squares) regression line

Y = A + BX (note A, B in caps) minimizes Se2

Solution to minimization problem

• For the mathematically inclined, here’s how A and B (optimum values


of a and b) may be calculated:

N  XY   X Y
B
N  X 2   X 
2

A  Y  BX

6
22-08-2024

Interpreting the slope

Y Y Y
X
B<0 X X X

X X X
X X
X X
X
X
X

B>0 X
B=0

X X X

The value of Y does


Larger values of X Larger values of X
not depend on X:
are associated with are associated with
the best estimate of
larger values of Y smaller values of Y
Y is simply its mean

Scale Invariance (or not)


• Let us say that, for area measured in square feet, the slope B of the
best-fit regression line is 500
• If we measure area in square meters, the value of B would work out
to be 5382
• Is that a problem?
• $500 per square foot vs $5382 per square meter?
• we can standardize all X and Y values before we start … then regression
coefficient B is scale-free

7
22-08-2024

Correlation and regression …2


• The correlation coefficient r and the regression slope B
are related as follows:
r  BS X / SY 
• where SX and SY are the standard deviations of X and Y
respectively
• r also has the benefit of being scale-invariant
• it does not matter whether area is measured in square feet or
square meters, or whether price is measured in INR or USD

Standardized regression coefficients


• Recall that regression coefficients are not scale-invariant
• i.e. they depend on the units of measurement
• To get scale-invariant coefficients
• standardize Y as well as X1, X2, …, Xn, estimate
zY  C  D1z X1  D2 z X 2  ...  Dn z X n
• the z-score of Y is modeled as a function of the z-scores of Xi … the
coefficients Di are scale-invariant
• Also used when the relative magnitudes of Xi differ
widely (in their “natural” units)

8
22-08-2024

Generalizing to multiple regression


• How does Y vary with the levels of multiple
“explanatory” variables?
Y = A + B 1 X 1 + B 2 X 2 + … + B nX n
• Bi is the slope of Y on dimension Xi
• B1, B2, …, Bn called “partial” regression coefficients
• the magnitudes (and even signs) of B1, B2, …, Bn depend on which
other variables are included in the multiple regression model
• might not agree in magnitude (or even sign) with the bivariate
correlation coefficient r between Xi and Y

Predictive power
• R = bivariate correlation between Yobserved and Ypredicted
(how well do they agree?)
• Consider the proportionate reduction in prediction
error (PRE) using the model
 Y obs   
 Y   Yobs  Y pred  /  Yobs  Y
2 2
2

• to the baseline of predicting Y using just its mean Y


• turns out that PRE = R2
• R2 or R-square measures the predictive power of the
multiple regression model

9
22-08-2024

Hypothesis-testing in regression
• Consider Y = A + B1X1 + B2X2 + …+ BnXn
• For the null hypothesis H0 that ALL the coefficients Bi
are zero, B1 = B2 = Bn = 0
• and the alternate hypothesis Ha that at least one Bi is
NOT zero, Bi  0
R2 / k
F
• the test statistic is
1  R /n  k  1
2

• k = number of explanatory variables Xi


• n = number of observations (sample size)

Overall F-test of model


• The test statistic is compared against the
F-distribution with df1 = k and df2 = n-(k+1)
• If the test statistic is large, the area to the right of this value will be
small
• small p-value enables rejection of the null hypothesis (H0: all Bi are zero)
• note that this is more likely if R2 is large
• A model that fails this test is no better than no model (in terms of
prediction error)

10
22-08-2024

Significance of coefficients
• Whether each coefficient Bi differs significantly from
zero is tested using the test statistic Bi /  Bi
(value of coefficient / standard error)
• compared against t-distribution with n-(k+1) df
• Each coefficient can be tested in this manner
• H0: coefficient is zero vs. Ha: coefficient is not zero
• When a coefficient Bi fails this test, it is not significantly
different from zero, and the term involving Xi can be
dropped from the model

Desirable properties of regression model


• High R2
• indicates that a large proportion of the variation in Y is explained by the
independent variables
• Significant F-test
• the null hypothesis that all Bi are zero can be conclusively rejected
• Significant coefficients (t-test)
• change in each explanatory variable significantly affects the level of the
dependent variable

11
22-08-2024

Another example: Boston housing prices


Variables
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

Excerpt of Boston housing data


crim zn indus chas nox ptratio b lstat medv
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5

12
22-08-2024

Boston housing regression model

Boston housing: Regression model predictions


crim zn indus chas nox ptratio b lstat medv Predicted values Residuals
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24 30.0 -6.00
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6 25.0 -3.43
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7 30.6 4.13
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4 28.6 4.79
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2 27.9 8.26
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7 25.3 3.44
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9 23.0 -0.10
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1 19.5 7.56
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5 11.5 4.98

• Negative residuals (actual – predicted) -> underpriced? -> good value?


• Positive residuals -> overpriced?

13
22-08-2024

Getting carried away … the story of Zillow

14

You might also like