0% found this document useful (0 votes)

7 views

Week 03 Regression

Uploaded by

sabrinashah2002

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Week 03 Regression

Uploaded by

sabrinashah2002

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

22-08-2024

TOD 533
Correlation, Introduction to Regression
Amit Das
TODS / AMSOM / AU
[email protected]

Association between interval variables

• Do two interval variables “move together” ?
• When one takes on “high” values (relative to its mean),
what does the other do?
• Pearson correlation coefficient

r
 Z x Zy
 1  r  1
N
• When high (low) z-scores of the two variables co-occur, the
correlation coefficient is larger

1
22-08-2024

Computing the correlation coefficient

Task 1 Task 2 Product of
z-scores
Student Raw Score z-score Raw Score z-score
1 42 +1.78 90 +1.21 +2.15
2 9 -1.04 40 -1.65 +1.72
3 28 +0.58 92 +1.33 +0.77
4 11 -0.87 50 -1.08 +0.94
5 8 -1.13 49 -1.13 +1.28
6 15 -0.53 63 -0.33 +0.17
7 14 -0.62 68 -0.05 +0.03
8 25 +0.33 75 +0.35 +0.12
9 40 +1.61 89 +1.16 +1.87
10 20 -0.10 72 +0.18 -0.02
SUM 212 0 688 0 +9.03
MEAN 21.2 0 68.8 0 +0.903
STD. DEV. 11.69 1 17.47 1

Eyeballing correlation

2
22-08-2024

Statistical significance of r
• Null hypothesis: r = 0
Compute test statistic = n2
r
1  r2
• Compare against t-distribution with df = n-2

• For r = 0.903 with n = 10,

• test statistic = 5.94, compare against t8 distribution
• p-value (2-tailed) = 0.0003 << 0.05

Correlation and sample size

• Significance of r depends on sample size
• for larger n, smaller value of r might be significant

Sample size Value of r required to reach statistical significance at …

10% (two-tailed) 5% (two-tailed)
12 0.497 0.576
22 0.360 0.423
32 0.296 0.349
42 0.257 0.304
52 0.231 0.273
102 0.164 0.195

• for very large n, a very small r might be significant

• statistical vs. managerial significance

3
22-08-2024

Association between ordinal variables

• The Spearman rank correlation coefficient

6 d 2

rs  1 
n n 2  1
• where d is the difference in the ranks of a given individual for the two
variables
• suitable for ordinal data
• less affected than Pearson r by outliers

Rank Correlation example

Task 1 Task 2 (Difference
in ranks)2
Student Raw Score Rank 1 Raw Score Rank 2
1 42 1 90 2 1
2 9 9 40 10 1
3 28 3 92 1 4
4 11 8 50 8 0
5 8 10 49 9 1
6 15 6 63 7 1
7 14 7 68 6 1
8 25 4 75 4 0
9 40 2 89 3 1
10 20 5 72 5 0

• Spearman rank correlation coefficient

= 6  10
1 = 94%
10  100  1

4
22-08-2024

Correlation and regression …1

• Earlier, we examined whether two interval-scaled variables are
associated (“move together”) using the correlation coefficient
-1  r  +1
• linear regression frames the same question in a slightly different form
• by modeling the dependent variable Y as a linear function of the independent
variable X
Y = a + bX

The linear regression model

slope b = p/q X
X
price in dollars

X
X X
p
X X

X q

X
intercept a

area in square feet

Relation of apartment prices to floor area (hypothetical)

5
22-08-2024

The best-fit regression line

• More than one line can be passed through the cloud (“scatterplot”) of
Y on X
• each line denotes a combination of a and b
• For each line
• for each data point compute error = Yobs – Ypred
• square the errors and add them up Se2
• The best-fit (least-squares) regression line

Y = A + BX (note A, B in caps) minimizes Se2

Solution to minimization problem

• For the mathematically inclined, here’s how A and B (optimum values

of a and b) may be calculated:

N  XY   X Y
B
N  X 2   X 
2

A  Y  BX

6
22-08-2024

Interpreting the slope

Y Y Y
X
B<0 X X X

X X X
X X
X X
X
X
X

B>0 X
B=0

X X X

The value of Y does

Larger values of X Larger values of X
not depend on X:
are associated with are associated with
the best estimate of
larger values of Y smaller values of Y
Y is simply its mean

Scale Invariance (or not)

• Let us say that, for area measured in square feet, the slope B of the
best-fit regression line is 500
• If we measure area in square meters, the value of B would work out
to be 5382
• Is that a problem?
• $500 per square foot vs $5382 per square meter?
• we can standardize all X and Y values before we start … then regression
coefficient B is scale-free

7
22-08-2024

Correlation and regression …2

• The correlation coefficient r and the regression slope B
are related as follows:
r  BS X / SY 
• where SX and SY are the standard deviations of X and Y
respectively
• r also has the benefit of being scale-invariant
• it does not matter whether area is measured in square feet or
square meters, or whether price is measured in INR or USD

Standardized regression coefficients

• Recall that regression coefficients are not scale-invariant
• i.e. they depend on the units of measurement
• To get scale-invariant coefficients
• standardize Y as well as X1, X2, …, Xn, estimate
zY  C  D1z X1  D2 z X 2  ...  Dn z X n
• the z-score of Y is modeled as a function of the z-scores of Xi … the
coefficients Di are scale-invariant
• Also used when the relative magnitudes of Xi differ
widely (in their “natural” units)

8
22-08-2024

Generalizing to multiple regression

• How does Y vary with the levels of multiple
“explanatory” variables?
Y = A + B 1 X 1 + B 2 X 2 + … + B nX n
• Bi is the slope of Y on dimension Xi
• B1, B2, …, Bn called “partial” regression coefficients
• the magnitudes (and even signs) of B1, B2, …, Bn depend on which
other variables are included in the multiple regression model
• might not agree in magnitude (or even sign) with the bivariate
correlation coefficient r between Xi and Y

Predictive power
• R = bivariate correlation between Yobserved and Ypredicted
(how well do they agree?)
• Consider the proportionate reduction in prediction
error (PRE) using the model
 Y obs   
 Y   Yobs  Y pred  /  Yobs  Y
2 2
2

• to the baseline of predicting Y using just its mean Y

• turns out that PRE = R2
• R2 or R-square measures the predictive power of the
multiple regression model

9
22-08-2024

Hypothesis-testing in regression
• Consider Y = A + B1X1 + B2X2 + …+ BnXn
• For the null hypothesis H0 that ALL the coefficients Bi
are zero, B1 = B2 = Bn = 0
• and the alternate hypothesis Ha that at least one Bi is
NOT zero, Bi  0
R2 / k
F
• the test statistic is
1  R /n  k  1
2

• k = number of explanatory variables Xi

• n = number of observations (sample size)

Overall F-test of model

• The test statistic is compared against the
F-distribution with df1 = k and df2 = n-(k+1)
• If the test statistic is large, the area to the right of this value will be
small
• small p-value enables rejection of the null hypothesis (H0: all Bi are zero)
• note that this is more likely if R2 is large
• A model that fails this test is no better than no model (in terms of
prediction error)

10
22-08-2024

Significance of coefficients
• Whether each coefficient Bi differs significantly from
zero is tested using the test statistic Bi /  Bi
(value of coefficient / standard error)
• compared against t-distribution with n-(k+1) df
• Each coefficient can be tested in this manner
• H0: coefficient is zero vs. Ha: coefficient is not zero
• When a coefficient Bi fails this test, it is not significantly
different from zero, and the term involving Xi can be
dropped from the model

Desirable properties of regression model

• High R2
• indicates that a large proportion of the variation in Y is explained by the
independent variables
• Significant F-test
• the null hypothesis that all Bi are zero can be conclusively rejected
• Significant coefficients (t-test)
• change in each explanatory variable significantly affects the level of the
dependent variable

11
22-08-2024

Another example: Boston housing prices

Variables
1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10. TAX - full-value property-tax rate per $10,000
11. PTRATIO - pupil-teacher ratio by town
12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT - % lower status of the population
14. MEDV - Median value of owner-occupied homes in $1000's

Excerpt of Boston housing data

crim zn indus chas nox ptratio b lstat medv
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5

12
22-08-2024

Boston housing regression model

Boston housing: Regression model predictions

crim zn indus chas nox ptratio b lstat medv Predicted values Residuals
0.00632 18 2.31 0 0.538 15.3 396.9 4.98 24 30.0 -6.00
0.02731 0 7.07 0 0.469 17.8 396.9 9.14 21.6 25.0 -3.43
0.02729 0 7.07 0 0.469 17.8 392.83 4.03 34.7 30.6 4.13
0.03237 0 2.18 0 0.458 18.7 394.63 2.94 33.4 28.6 4.79
0.06905 0 2.18 0 0.458 18.7 396.9 5.33 36.2 27.9 8.26
0.02985 0 2.18 0 0.458 18.7 394.12 5.21 28.7 25.3 3.44
0.08829 12.5 7.87 0 0.524 15.2 395.6 12.43 22.9 23.0 -0.10
0.14455 12.5 7.87 0 0.524 15.2 396.9 19.15 27.1 19.5 7.56
0.21124 12.5 7.87 0 0.524 15.2 386.63 29.93 16.5 11.5 4.98

• Negative residuals (actual – predicted) -> underpriced? -> good value?

• Positive residuals -> overpriced?

13
22-08-2024

Getting carried away … the story of Zillow

Bods Functions
83% (6)
Bods Functions
17 pages
Sample Motion For Reconsideration - POEA
No ratings yet
Sample Motion For Reconsideration - POEA
8 pages
Chapter 2-Doubly Reinforced Beam
No ratings yet
Chapter 2-Doubly Reinforced Beam
32 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Correlation
No ratings yet
Correlation
29 pages
Spat Itttttt Ttttt Ttttt
No ratings yet
Spat Itttttt Ttttt Ttttt
48 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
RMD S10 Regression
No ratings yet
RMD S10 Regression
22 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (1)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Lecture 8 and 9 Regression Correlation and Index
No ratings yet
Lecture 8 and 9 Regression Correlation and Index
32 pages
REGRESSION ANALYSIS 1 and 2 Notes
No ratings yet
REGRESSION ANALYSIS 1 and 2 Notes
9 pages
Corelation and Regression
No ratings yet
Corelation and Regression
137 pages
SCM Session 6 Correlation and Regression Analysis
No ratings yet
SCM Session 6 Correlation and Regression Analysis
63 pages
Topic 6
No ratings yet
Topic 6
22 pages
Lecture+8+ +Linear+Regression
No ratings yet
Lecture+8+ +Linear+Regression
45 pages
Topic - chapter 12 - Regression models
No ratings yet
Topic - chapter 12 - Regression models
1 page
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
No ratings yet
Simple Regression and Simple Correlation: MA261 Statistical and Numerical Techniques March 24, 2022
52 pages
Corr and Regress
No ratings yet
Corr and Regress
61 pages
Quantitative Anaysise Solomon
No ratings yet
Quantitative Anaysise Solomon
51 pages
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
No ratings yet
Correlation and Regression: Associate Professor Georgi Iskrov, PHD Department of Social Medicine and Public Health
28 pages
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
No ratings yet
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
43 pages
Corr_Regression Analysis
No ratings yet
Corr_Regression Analysis
19 pages
Correlation Regression
No ratings yet
Correlation Regression
58 pages
Stats10_Chapter+4 2
No ratings yet
Stats10_Chapter+4 2
54 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Multiple Linear Regression: y BX BX BX
No ratings yet
Multiple Linear Regression: y BX BX BX
14 pages
Correlation and Regression Analysis Using SPSS
No ratings yet
Correlation and Regression Analysis Using SPSS
102 pages
Ch 4- Correlation and Regression YARA&LAMA
No ratings yet
Ch 4- Correlation and Regression YARA&LAMA
27 pages
Unit 4-1
No ratings yet
Unit 4-1
29 pages
Intermediate Analytics-Regression-Week 1
No ratings yet
Intermediate Analytics-Regression-Week 1
52 pages
Corr and Regress
No ratings yet
Corr and Regress
42 pages
Simple Regression
No ratings yet
Simple Regression
46 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Regression
No ratings yet
Regression
12 pages
Regression and Correlation
No ratings yet
Regression and Correlation
66 pages
PARAMETRIC-TEST
No ratings yet
PARAMETRIC-TEST
49 pages
Corr and Regress
No ratings yet
Corr and Regress
30 pages
Theme 3 Multivariante Regression Model
No ratings yet
Theme 3 Multivariante Regression Model
8 pages
Multiple Regression: by Dr. D. Israel
No ratings yet
Multiple Regression: by Dr. D. Israel
23 pages
Note Multiple Regression KOM 6115
No ratings yet
Note Multiple Regression KOM 6115
18 pages
Correlation and regression
No ratings yet
Correlation and regression
30 pages
Correlation Regression
100% (1)
Correlation Regression
25 pages
Cha 6
No ratings yet
Cha 6
8 pages
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
No ratings yet
Statistics For Business STAT130: Unit 8: Correlation and Regression Analysis
56 pages
BES - Lecture 10 - Simple Linear Regression
No ratings yet
BES - Lecture 10 - Simple Linear Regression
15 pages
Linear correlation and linear regression
No ratings yet
Linear correlation and linear regression
37 pages
Lecture Week 13 - Regression
No ratings yet
Lecture Week 13 - Regression
10 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Screenshot 2023-12-04 at 11.27.14
No ratings yet
Screenshot 2023-12-04 at 11.27.14
32 pages
Session 5 Marked B PDF
No ratings yet
Session 5 Marked B PDF
36 pages
6 Continuous Data Analysis
No ratings yet
6 Continuous Data Analysis
49 pages
Chapter 14, Multiple Regression Using Dummy Variables
No ratings yet
Chapter 14, Multiple Regression Using Dummy Variables
19 pages
L4&5 Multiple Regression 2010B
No ratings yet
L4&5 Multiple Regression 2010B
77 pages
5 Chapter Fi
No ratings yet
5 Chapter Fi
29 pages
Handout 5 Correlation and Regression (Recovered)
No ratings yet
Handout 5 Correlation and Regression (Recovered)
6 pages
chapter 3
No ratings yet
chapter 3
31 pages
Regression&Corr&Annova
No ratings yet
Regression&Corr&Annova
71 pages
Session-Multiple Regression
No ratings yet
Session-Multiple Regression
26 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
Applications of Finite Mathematics
From Everand
Applications of Finite Mathematics
Gautami Devar
No ratings yet
Gre Formula Book
From Everand
Gre Formula Book
Saifuddin Kamran
No ratings yet
Electrical Machines
No ratings yet
Electrical Machines
4 pages
Abb HHC 691HT (Ime691ht)
No ratings yet
Abb HHC 691HT (Ime691ht)
60 pages
Lawyers Format-1 PDF Lawyer 2
No ratings yet
Lawyers Format-1 PDF Lawyer 2
1 page
D2466 PDF
No ratings yet
D2466 PDF
5 pages
tvs 2013-2014 2
No ratings yet
tvs 2013-2014 2
965 pages
Full Download Scenes of Instruction The Beginnings of the U S Study of Film 1st Edition Dana Polan PDF DOCX
100% (3)
Full Download Scenes of Instruction The Beginnings of the U S Study of Film 1st Edition Dana Polan PDF DOCX
26 pages
Gen AIfor PPs 060424
No ratings yet
Gen AIfor PPs 060424
25 pages
Chandrakumar Resume Updated
No ratings yet
Chandrakumar Resume Updated
4 pages
Iad Questions.
No ratings yet
Iad Questions.
3 pages
Unit 1.0 Introduction To Communication and Its Types
No ratings yet
Unit 1.0 Introduction To Communication and Its Types
14 pages
Drift 2049 Conversion Manual
No ratings yet
Drift 2049 Conversion Manual
29 pages
The Intellectual Property Review (Dominick A Conde)
No ratings yet
The Intellectual Property Review (Dominick A Conde)
350 pages
Toolmaker For Open Inventor
No ratings yet
Toolmaker For Open Inventor
243 pages
Nomination For LTA Form A
No ratings yet
Nomination For LTA Form A
3 pages
Flare Industries Sales@Pax2e
No ratings yet
Flare Industries Sales@Pax2e
60 pages
Observ H.264
No ratings yet
Observ H.264
69 pages
Rohit Jadhav Edp
No ratings yet
Rohit Jadhav Edp
22 pages
Bahan Ajar Bhs Inggris
No ratings yet
Bahan Ajar Bhs Inggris
39 pages
Download (Ebook) Fundamentals of Sound and Vibration, Second Edition by Fahy, Frank; Thompson, David ISBN 9781482266634, 1482266636 ebook All Chapters PDF
No ratings yet
Download (Ebook) Fundamentals of Sound and Vibration, Second Edition by Fahy, Frank; Thompson, David ISBN 9781482266634, 1482266636 ebook All Chapters PDF
76 pages
CV
No ratings yet
CV
3 pages
Realsystem 5.0 Security Features Whitepaper: Who Should Read This?
No ratings yet
Realsystem 5.0 Security Features Whitepaper: Who Should Read This?
10 pages
Project Fact Sheet - Adopt-A-Pamaypay
No ratings yet
Project Fact Sheet - Adopt-A-Pamaypay
5 pages
Daikin VRV IV
No ratings yet
Daikin VRV IV
36 pages
Almirah project report
No ratings yet
Almirah project report
51 pages
Virus Threats Including Network Viruses, Worms
No ratings yet
Virus Threats Including Network Viruses, Worms
67 pages
Ic23 Unit03 Script
No ratings yet
Ic23 Unit03 Script
26 pages
EDCOIN APP - EDC Withdrawal Process
80% (5)
EDCOIN APP - EDC Withdrawal Process
14 pages

Week 03 Regression

Uploaded by

Week 03 Regression

Uploaded by

22-08-2024

Association between interval variables

Computing the correlation coefficient

• For r = 0.903 with n = 10,

Correlation and sample size

Sample size Value of r required to reach statistical significance at …

• for very large n, a very small r might be significant

Association between ordinal variables

Rank Correlation example

• Spearman rank correlation coefficient

Correlation and regression …1

The linear regression model

area in square feet

Relation of apartment prices to floor area (hypothetical)

The best-fit regression line

Y = A + BX (note A, B in caps) minimizes Se2

Solution to minimization problem

• For the mathematically inclined, here’s how A and B (optimum values

Interpreting the slope

The value of Y does

Scale Invariance (or not)

Correlation and regression …2

Standardized regression coefficients

Generalizing to multiple regression

• to the baseline of predicting Y using just its mean Y

• k = number of explanatory variables Xi

Overall F-test of model

Desirable properties of regression model

Another example: Boston housing prices

Excerpt of Boston housing data

Boston housing regression model

Boston housing: Regression model predictions

• Negative residuals (actual – predicted) -> underpriced? -> good value?

Getting carried away … the story of Zillow

You might also like